Chapter 1 The Tidy Text Format
tidy data (Wickham 2014) is:
- Each variable is a column.
- Each observation is a row.
- Each type of observational unit is a table.
In the text analysis, the tide text format is a table that contains one token per row.
One token: a meaniingful unit of text (e.g., words, n-gram, sentence, or paragraph)
tidytext package: keep text data in a tidy format (i.e., Using the tidyverse
package for tidy data processing).
Other R packages for text-mining or text analysis: tm
, quanteda
, sentiment
, text2vec
, etc.
Check out the CRAN Task View: Natural Language Processing for R packages of text analysis.