1.2 The unnest_tokens Function
<- c("Because I could not stop for Death -",
text "He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immotality")
text
## [1] "Because I could not stop for Death -"
## [2] "He kindly stopped for me -"
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immotality"
We need to put this into a data frame to convert it into a tidy text dataset.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
<- tibble(line=1:4, text=text)
text_df text_df
## # A tibble: 4 × 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death -
## 2 2 He kindly stopped for me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immotality
Now, we can extract tokens (i.e., words in this example) from the data frame by using the unnest_tokens
function.
library(tidytext)
%>%
text_df unnest_tokens(word, text)
## # A tibble: 20 × 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## 11 2 for
## 12 2 me
## 13 3 the
## 14 3 carriage
## 15 3 held
## 16 3 but
## 17 3 just
## 18 3 ourselves
## 19 4 and
## 20 4 immotality
unnest_tokens() function
- Other columns, such as the line number each word came from are retained.
- Punctuation has been stripped.
- By default, unnest_tokens()
converts the tokens to lowercase, which makes them easier to compare or combine with other datasets (to_lower=FALSE to turn off this option)