5.6 Creating Features for Text Data

Often, data contain textual fields that are gathered from questionnaires, articles, reviews, tweets, and other sources.

Are there words or phrases that would make good predictors of the outcome? To determine this, the text data must first be processed and cleaned.

One approach is to measure for “importance”, that is, keywords with odds-ratios of at least 2 (in either direction) to be considered for modeling.

See also Supervised Machine Learning for Text Analysis in R for a much better explanation.

Other methods for preprocessing text data include:

  • removing commonly used stop words, such as “is”, “the”, “and”, etc.
  • stemming the words so that similar words, such as the singular and plural versions, are represented as a single entity.
  • filter for the most common tokens, and then calculate the term frequency-inverse document frequency (tf-idf) statistic for each token