5.2 Encoding Predictors for Many Categories
What happens when the number of factor levels gets very large?
For example, there are more than 40K possible ZIP codes and, depending on how the data are collected, this might produce an overabundance of dummy variables for the size of the data available. Also, ZIP codes in highly populated areas may have a higher rate of occurrence in the data, leading to a “long tail” of locations that are infrequently observed.
Also, resampling will exclude some of the rarer categories from the analysis set.
The first way to handle this issue is to create the full set of dummy variables and simply remove the zero and low-variance predictors.
Still, we may not desire to filter these out.
Another approach is to feature engineer an “other” category that pools the rarely occurring categories, assuming that such a pooling is sensible.
Another way to combine categories is to use a hashing function that maps each factor level key to a hash value. The number of possible hashes is set by the user and, for numerical purposes, is a power of 2. Some computationally interesting aspects to hash functions are
- The only data required is the value being hashed and the resulting number of hashes. The translation process is completely deterministic.
- Hash functions are unidirectional; once the hash values are created, there is no way of knowing the original values. If there are a known and finite set of original values, a table can be created to do the translation but, otherwise, the keys are indeterminable when only the hash value is known.
- There is no free lunch when using this procedure; some of the original categories will be mapped to the same hash value (called a “collision”). The number of collisions will be largely determined by the number of features that are produced.
Categories involved in collisions are not related in any meaningful way. Because of the arbitrary nature of the collisions, it is possible to have different categories whose true underlying effect are counter to one another. This might have the effect of negating the impact of the hashed feature.
Hashing functions have no notion of the probability that each key will occur. As such, it is conceivable that a category that occurs with great frequency is aliased with one that is rare. In this case, the more abundant value will have a much larger influence on the effect of that hashing feature.