How much data is enough?

A Reddit reader asked how much data is needed for a machine learning project to get meaningful results. Prof. Yaser Abu-Mostafa from Caltech answered this very question in his online course.

The answer is that as a rule of thumb, you need roughly 10 times as many examples as there are degrees of freedom in your model.

In case of a linear model, degrees of freedom essentially equal data dimensionality (a number of columns). We find that thinking in terms of dimensionality vs number of examples is a convenient shortcut.

The more powerful the model, the more it’s prone to overfitting and so the more examples you need. And of course the way of controlling this is through validation.

Breaking the rules

In practice you can get away with less than 10x, especially if your model is simple and uses regularization. In Kaggle competitions the ratio is often closer to 1:1, and sometimes dimensionality is far greater than a number of examples, depending on how you pre-process the data.

Specifically, text represented as a bag of words may be very high-dimensional and very sparse. For instance, consider Online Learning Library experiments on a binary version of News20 dataset. With 15k training points and well above million features you can get 96% accuracy (and not because the classes are skewed - they are perfectly balanced).

UPDATE: Jake Vanderplas has an extensive article about the topic at hand: The Model Complexity Myth. It says that one can fit underdetermined models (with more parameters than examples) successfully thanks to conditioning, that is regularization/priors.

FastML

Machine learning made easy

How much data is enough?

Breaking the rules

Comments