1.4 Important concepts
- Overfitting
- Supervised and unsupervised
- Model bias and variance
- Experience and empirically driven modeling
Generalizing the main boundaries, the risk of overfitting the model is always challenged by anomalous patterns new data can hide.
1.4.1 Acknowledge vulnerabilities
To consider:
small number of observations compared to the number of predictors
low bias models can have a higher likelihood of overfitting
supervised analysis can be used to detect predictors significance
No free lunch therem (Wolpert, 1996) - knowledge is an important part of modeling
variance-bias trade-off
- Low variance: linear/logistic regression and PLS
- High variance: trees, nearest neighbor, neural networks
- Bias: level of ability to closer estimation
irrilevant predictors can causing excess model variation
be data-driven rather than experience-driven
big data does not mean better data
unlabeled data can improve autoencoders modeling
compensatory effect there may not be a unique set of predictors.
Finally, one more important consideration is to consider Strategies for Supervised and Unsupervised feature selections.
Supervised selection method can be divided into:
- wrapper methods, such as backwards and stepwise selection
- embedded methods, such as decision tree variable selection
Unsupervised selection method
- variable encoding, such as dummy or indicator variables