Mathematically prove why sparsity leads to model overfitting

Question

With respect to the stackoverflow post here: https://stackoverflow.com/a/59566478/9130959
I can't quite get why the logic stands: when # features increases, the hypothesis space is expanded, leading to sparse data, thus easily overfit. Is there a way to mathematically explain all these?

Thx in advance.

Is there a specific model you're working on? The math varies based on the model, but for example on a feedforward NN the number of patterns that can be memorized is bounded by the number of absolute free parameters of the network divided by the number of outputs, so the fewer the training points or the more the parameters the more likely it is that a pattern perfectly fitting the training data is produced. Or for example picture fitting a curve using increasing numbers of terms in a Taylor series: the more terms added, the easier it is to approximate that curve. What kind of model do you have? — C8H10N4O2, Jun 20 '20 at 16:26

score 1 · Answer 1 · answered Jun 21 '20 at 03:14

I don't think sparsity in itself cause the model to over-fit, but it can increase the chance of it.

Overfitting is not a state in itself. What I meant is that if I say that my training accuracy is 92%. Can you predict if it's over-fit?
It's a relative state i.e. you can only know it if I tell the accuracy on test/new data.

It means, farther the new data from the training data more is the chance that the model will Over-fit.

In a high dimension space, let's say a 10 features dataset -
To fill the space evenly you need 10¹⁰ record which is 10Bn. Normally you might have 1Mn.
So, it's almost 9999/10000 chance that you will get a new data outside the training data.
Definitely the model will have a very high chance to fail(Assuming no Regularization in place)(I am not saying that it will fail 9999 times)

But
This is completely based on the assumption that every feature will have values spanning the full space.
Let's say 6 out of 10 features have only 2 possible value. Then we will have only 6.4Mn datasets required to fill the space evenly. In this case, there is very less chance of the model to be over-fit compared to the last case.
Although the model will be very complex and wiggly(a lot of branches/leaves), but not a lot over-fitted

score 0 · Answer 2 · answered Jun 26 '20 at 06:40

For me the most intuitive example is with linear classifier

If you have 1 row and 1 column in train, and you fit a linear classifier, your score will be 100%
If you have 2 rows and 2 columns and you train a linear classifier your score will be 100%

....

If you have N rows and N columns and you train a linear classifier your score will be 100%

Why is that? You will have always a N-dimensional hyperplane that will go through all the points. For the 2 points you will have a line, for the 3 you will have a plane.....

This is not realistic at all, thus you are overfitting... For other methods is not so intuitive as it is more complicated to have an intuition about what the algorithm does

Mathematically prove why sparsity leads to model overfitting

2 Answers2