10.6 A case study | Feature Engineering and Selection Book Club

10.6 A case study

Incorrectly combining feature selection and resampling

Goal: identify a subset of predictors with 80% accuracy

75 samples from each of the two classes
10,000 predictors for each data point
70% of the data for a training set
10-fold cross-validation for model training
implicit feature selection methods of the glmnet and random forest.

Initial results:

60% best cross-validation accuracy

First adjustments approach that resulted in worst results:

PCA for reducing dimensions
linear discriminant analysis
partial least squares discriminant analysis

Second adjustments approach:

first identify and select predictors that had a univariate signal with the response

t-test performed for each predictors
predictors rank by significance
top 300 predictors selected for modeling
PCA on the 300 predictors
plotted first two components colored by response category

First two components obtained with: feature selection outside of resampling

Figure 10.3: First two components obtained with: feature selection outside of resampling

This second method provided a clear answer, showing a complete separation of the components.