1.4 Important concepts

Generalizing the main boundaries, the risk of overfitting the model is always challenged by anomalous patterns new data can hide.

To consider:

small number of observations compared to the number of predictors
low bias models can have a higher likelihood of overfitting
supervised analysis can be used to detect predictors significance
No free lunch therem (Wolpert, 1996) - knowledge is an important part of modeling
variance-bias trade-off
- Low variance: linear/logistic regression and PLS
- High variance: trees, nearest neighbor, neural networks
- Bias: level of ability to closer estimation
irrilevant predictors can causing excess model variation
be data-driven rather than experience-driven
big data does not mean better data
unlabeled data can improve autoencoders modeling
compensatory effect there may not be a unique set of predictors.

Finally, one more important consideration is to consider Strategies for Supervised and Unsupervised feature selections.

Supervised selection method can be divided into:

Unsupervised selection method

Few steps summary: