5.7 What proportion should be used?

https://twitter.com/asmae_toumi/status/1356024351720689669?s=20

Some people said the 80/20 split comes from the Pareto principle/distribution or the power law. Some said because it works nicely with 5-fold cross-validation (which we will see in the later chapters).

I believe the point is to use enough data in the training set to allow for solid parameter estimation but not too much that it hurts performance. 80/20 or 70/30 seems reasonable for most problems at hand, as it’s what is widely used. Max Kuhn notes that a test set is almost always a good idea, and it should only be avoided when the data is “pathologically small”.