1.1 Contrasting Tidy Text with Other Data Structures

  • String: character vectors (i.e., each letter, words, etc.)
  • Corpus: raw strings annotated with additional metadata and details (i.e, a bag of words)
  • Document-term matrix: a sparse matrix describing a collection of documents with one row for each document and one column for each term (tf-idf in Chapter 3).