12.3 12.3 Genetic Algorithms

The basic procedure of a genetic algorithm

Genetic Algorithms are optimization tools that allow us to find global optimum solutions by mimicing Charles Darwins theory of Natural Selection. Each GA starts with an initial populations. Those with the highest performance essentially ‘mate’ and there is a crossover of their genes (features). The offspring that are create also have a small subset of their genes randomly mutated. This helps the algorithm avoid local optimum.

Unlike simulated annealing, the GA feature subsets are grouped into generations instead of considering one subset at a time. But a generation in a GAs is similar to an iteration in simulated annealing.

The key is to setting the initial population size to something that will capture the maximum amount of variability in the system.

ID										Performance	Probability (%)
1			C			F				0.54	6.4
2	A			D	E	F		H		0.55	6.5
3				D						0.51	6.0
4					E					0.53	6.2
5				D			G	H	I	0.75	8.8
6		B			E		G		I	0.64	7.5
7		B	C			F			I	0.65	7.7
8	A		C		E		G	H	I	0.95	11.2
9	A		C	D		F	G	H	I	0.81	9.6
10			C	D	E				I	0.79	9.3
11	A	B		D	E		G	H		0.85	10.0
12	A	B	C	D	E	F	G		I	0.91	10.7

Higher performance is generally better, but in order to reduce the likelihood of getting stuck in a local optimum, a weight selection probability score is created. A simple way to compute the selection probability is to divide an individual feature subset’s performance value by the the sum of all the performance values. The rightmost column in the table above shows the results for this generation.

Selected parents from first generation:

ID
6		B			E		G		I
12	A	B	C	D	E	F	G		I

Their offspring

ID
13		B			E	F	G		I
14	A	B	C	D	E		G		I

After this, mutation kicks in, which is usually a low number (1-2% chance).

ID
13		B			E	F	G
14	A	B	C	D	E		G		I

This process is continued for however many generations you set. The final suggested features will be evaluated across all generations, through a process known as elitism.

12.3.1 12.3.1 External Validation

Basically, the number of iterations is once again a tuning parameter, except the iterations are called generations.

A genetic algorithm makes gradual improvements in predictive performance through changes to feature subsets over time. Similar to simulated annealing, an external resampling procedure should be used to select an optimal number of generations.This will help prevent the GA from overfitting.

Just like SA, 10 fold cross validations was used. Here are the following defaults parameters of the GA:

Generation size: 50,
Crossover probability: 80%,
Mutation probability: 1%,
Elitism: No,
Number of generations: 14

Understanding characteristics of the members within a generation can also be useful. For example, the diversity of the subsets within and between generations can be quantified using a measure such as Jaccard similarity (Tan, Steinbach, and Kumar 2006). This will show us how the GA is converging on a solution.

Figure 12.8: The change in subset size, similarity, and performance over time for each external resample of the genetic algorithm.

Fig 12.8 shows that the subset size converges, generally the generations are becoming more similar, and that the ROC improves over time.

Looking at the entire dataset:

Figure 12.9: Genetic algorithm results for the final search using the entire training set.

To gauge the effectiveness of the search, 100 random subsets of size 63 were chosen and a naive Bayes model was developed for each. The GA selected subset performed better than 100% of the randomly selected subsets. This indicates that the GA did find a useful subset for predicting the response.

12.3.2 12.3.2 Coercing Sparsity

With GAs there is less of a penality for keeping a feature that has no impact on predictive performance. Ultimately this is not what we want.

A simple method for reducing the number of predictors is to use a surrogate measure of performance that has an explicit penalty based on the number of features. Section 11.4 introduced the Akaike information criterion (AIC) which augments the objective function with a penalty that is a function of the training set size and the number of model terms.

At this point, you’ll be optimizing two things, increasing model performance AND reducing the number of features, called multi-parameter optimization (MPO). Using these two parameters you can create a ‘desirability’ function which will reject certain solutions based on a specific threshold.

Figure 12.10: Examples of two types of desirability functions and the overall desirability surface when the two are combined.

For the OkCupid data, a desirability function with the following objectives was used to guide the GA:

maximize the area under the ROC curve between A=0.50A=0.50 and B=1.00B=1.00 with a scale factor of s=2.0s=2.0.
minimize the subset size to be within A=10A=10 and B=100B=100 with s=1.0s=1.0.

The result is a much smaller predictor space

Figure 12.11: Internal performance profiles for naive Bayes models using the genetic algorithm in conjunction with desirability functions.

Comparing unconstrained to constrained GA:

Figure 12.12: Test set results for two models derived from genetic algorithms.