1.2 The unnest_tokens Function

text <- c("Because I could not stop for Death -",
          "He kindly stopped for me -", 
          "The Carriage held but just Ourselves -",
          "and Immotality")
text
## [1] "Because I could not stop for Death -"  
## [2] "He kindly stopped for me -"            
## [3] "The Carriage held but just Ourselves -"
## [4] "and Immotality"

We need to put this into a data frame to convert it into a tidy text dataset.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
text_df <- tibble(line=1:4, text=text)
text_df
## # A tibble: 4 × 2
##    line text                                  
##   <int> <chr>                                 
## 1     1 Because I could not stop for Death -  
## 2     2 He kindly stopped for me -            
## 3     3 The Carriage held but just Ourselves -
## 4     4 and Immotality

Now, we can extract tokens (i.e., words in this example) from the data frame by using the unnest_tokens function.

library(tidytext)
text_df %>% 
  unnest_tokens(word, text)
## # A tibble: 20 × 2
##     line word      
##    <int> <chr>     
##  1     1 because   
##  2     1 i         
##  3     1 could     
##  4     1 not       
##  5     1 stop      
##  6     1 for       
##  7     1 death     
##  8     2 he        
##  9     2 kindly    
## 10     2 stopped   
## 11     2 for       
## 12     2 me        
## 13     3 the       
## 14     3 carriage  
## 15     3 held      
## 16     3 but       
## 17     3 just      
## 18     3 ourselves 
## 19     4 and       
## 20     4 immotality

unnest_tokens() function - Other columns, such as the line number each word came from are retained. - Punctuation has been stripped. - By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets (to_lower=FALSE to turn off this option)