1.4 Important concepts

  • Overfitting
  • Supervised and unsupervised
  • Model bias and variance
  • Experience and empirically driven modeling

Generalizing the main boundaries, the risk of overfitting the model is always challenged by anomalous patterns new data can hide.

1.4.1 Acknowledge vulnerabilities

To consider:

  • small number of observations compared to the number of predictors

  • low bias models can have a higher likelihood of overfitting

  • supervised analysis can be used to detect predictors significance

  • No free lunch therem (Wolpert, 1996) - knowledge is an important part of modeling

  • variance-bias trade-off

    • Low variance: linear/logistic regression and PLS
    • High variance: trees, nearest neighbor, neural networks
    • Bias: level of ability to closer estimation
  • irrilevant predictors can causing excess model variation

  • be data-driven rather than experience-driven

  • big data does not mean better data

  • unlabeled data can improve autoencoders modeling

  • compensatory effect there may not be a unique set of predictors.

Finally, one more important consideration is to consider Strategies for Supervised and Unsupervised feature selections.

Supervised selection method can be divided into:

  • wrapper methods, such as backwards and stepwise selection
  • embedded methods, such as decision tree variable selection

Unsupervised selection method

  • variable encoding, such as dummy or indicator variables

1.4.2 The Modeling process

Few steps summary:

  1. EDA
  2. summary and correlation
  3. model methods evaluation
  4. model tuning
  5. summary measures and EDA
  6. residual analysis/ check for systematic issues
  7. more feature engineering
  8. model selection
  9. final bake off
  10. prediction