1.1 Contrasting Tidy Text with Other Data Structures
- String: character vectors (i.e., each letter, words, etc.)
- Corpus: raw strings annotated with additional metadata and details (i.e, a bag of words)
- Document-term matrix: a sparse matrix describing a collection of documents with one row for each document and one column for each term (
tf-idf
in Chapter 3).