Chapter 1 The Tidy Text Format

tidy data (Wickham 2014) is:

  • Each variable is a column.
  • Each observation is a row.
  • Each type of observational unit is a table.

In the text analysis, the tide text format is a table that contains one token per row.

One token: a meaniingful unit of text (e.g., words, n-gram, sentence, or paragraph)

tidytext package: keep text data in a tidy format (i.e., Using the tidyverse package for tidy data processing).

Other R packages for text-mining or text analysis: tm, quanteda, sentiment, text2vec, etc.

Check out the CRAN Task View: Natural Language Processing for R packages of text analysis.