2.2 Different types of tokens

We can tokenize texts at a variety of units including:

characters
words
sentences
lines
paragraphs
n-grams

Let’s explore how tokenizers handle these.

sample_vector <- c("Far down in the forest",
                   "grew a pretty little fir-tree")
sample_tibble <- tibble(text = sample_vector)

tokenize_words(sample_vector)

## [[1]]
## [1] "far"    "down"   "in"     "the"    "forest"
## 
## [[2]]
## [1] "grew"   "a"      "pretty" "little" "fir"    "tree"

tidytext::unnest_tokens does the same thing, but in a different data structure. Under the hood, it uses paste0("tokenize_", token) from the tokenizers package.

library(tidytext)
sample_tibble %>%
  unnest_tokens(word, text, token = "words")

## # A tibble: 11 × 1
##    word  
##    <chr> 
##  1 far   
##  2 down  
##  3 in    
##  4 the   
##  5 forest
##  6 grew  
##  7 a     
##  8 pretty
##  9 little
## 10 fir   
## 11 tree

You can pass in arguments used in tokenize_words() through unnest_tokens() using …

sample_tibble %>%
  unnest_tokens(word, text, token = "words", strip_punct = FALSE)

## # A tibble: 12 × 1
##    word  
##    <chr> 
##  1 far   
##  2 down  
##  3 in    
##  4 the   
##  5 forest
##  6 grew  
##  7 a     
##  8 pretty
##  9 little
## 10 fir   
## 11 -     
## 12 tree

2.2.1 Token type: Character

tokenize_characters() splits the text into letters. If strip_non_alphanum is TRUE, it strips all [:punct:] & [:whitespace:] before doing the the word boundaries.

tft_token_characters <- tokenize_characters(x = the_fir_tree,
                                            lowercase = TRUE,
                                            strip_non_alphanum = TRUE,
                                            simplify = FALSE)

head(tft_token_characters) %>%
  glimpse()

## List of 6
##  $ : chr [1:57] "f" "a" "r" "d" ...
##  $ : chr [1:57] "r" "e" "s" "t" ...
##  $ : chr [1:61] "w" "i" "s" "h" ...
##  $ : chr [1:56] "a" "r" "o" "u" ...
##  $ : chr [1:64] "l" "i" "t" "t" ...
##  $ : chr [1:64] "t" "h" "e" "m" ...

The same thing in unnest_tokens()

tibble(text = the_fir_tree) %>% 
  unnest_tokens(word, text, token = "characters")

## # A tibble: 13,429 × 1
##    word 
##    <chr>
##  1 f    
##  2 a    
##  3 r    
##  4 d    
##  5 o    
##  6 w    
##  7 n    
##  8 i    
##  9 n    
## 10 t    
## # … with 13,419 more rows

Watch out for ligatures!

Ligatures are when multiple letters are combined as a single character.

tokenize_characters("straße")

## [[1]]
## [1] "s" "t" "r" "a" "ß" "e"

Is it a meaningful feature to keep?
Is is sylistic or functional?

“Wie trinken die Schweizer Bier? – In Massen.” (“How do the Swiss drink beer? – In mass” instead of other meaning if it had been written as “in Maßen”: “in moderation”)

2.2.2 Token type: Word

We’ve already seen this before, but chaining this with dplyr can result in interesting analyses.

hcandersen_en %>%
  filter(book %in% c("The fir tree", "The little mermaid")) %>%
  unnest_tokens(word, text) %>%
  count(book, word) %>%
  group_by(book) %>%
  arrange(desc(n)) %>%
  slice(1:5)

## # A tibble: 10 × 3
## # Groups:   book [2]
##    book               word      n
##    <chr>              <chr> <int>
##  1 The fir tree       the     278
##  2 The fir tree       and     161
##  3 The fir tree       tree     76
##  4 The fir tree       it       66
##  5 The fir tree       a        56
##  6 The little mermaid the     817
##  7 The little mermaid and     398
##  8 The little mermaid of      252
##  9 The little mermaid she     240
## 10 The little mermaid to      199

2.2.3 Token type: n-grams

A continuous sequence of n items. (syllables, letters, words, …)

unigram: “Hello,” “day,” “my,” “little”
bigram: “fir tree,” “fresh air,” “to be,” “Robin Hood”
trigram: “You and I,” “please let go,” “no time like,” “the little mermaid”

n-grams can capture meaningful word orders that can otherwise be lost. (“fir tree”) It does so, by sliding across the text, to create overlapping sets of tokens.

tft_token_ngram <- tokenize_ngrams(x = the_fir_tree,
                                   lowercase = TRUE,
                                   n = 3L,
                                   n_min = 3L,
                                   stopwords = character(),
                                   ngram_delim = " ",
                                   simplify = FALSE)

tft_token_ngram[[1]]

##  [1] "far down in"      "down in the"      "in the forest"    "the forest where"
##  [5] "forest where the" "where the warm"   "the warm sun"     "warm sun and"    
##  [9] "sun and the"      "and the fresh"    "the fresh air"    "fresh air made"  
## [13] "air made a"       "made a sweet"

Using unigram is fast, but lose some information. Using higher n-grams keeps more information, but token counts decrease. You have to balance this trade off.

You can of course, tokenize many n-grams in the same place, by using n & n_min parameters.

tft_token_ngram <- tokenize_ngrams(x = the_fir_tree,
                                   n = 2L,
                                   n_min = 1L)
tft_token_ngram[[1]]

##  [1] "far"          "far down"     "down"         "down in"      "in"          
##  [6] "in the"       "the"          "the forest"   "forest"       "forest where"
## [11] "where"        "where the"    "the"          "the warm"     "warm"        
## [16] "warm sun"     "sun"          "sun and"      "and"          "and the"     
## [21] "the"          "the fresh"    "fresh"        "fresh air"    "air"         
## [26] "air made"     "made"         "made a"       "a"            "a sweet"     
## [31] "sweet"

This is beneficial, because unigrams would capture the frequency of the words, and the bigrams would supplement the meaning of tokens, that unigrams didn’t catch.

2.2.4 Token type: Lines, sentence, and paragraph

These rather large token types are rarely used for modelling purposes. The common approach for these, is to collapse all strings into one giant string, and split them using a delimeter.

2.2.4.1 Chapters/paragraphs

add_paragraphs <- function(data) {
  pull(data, text) %>%
    paste(collapse = "\n") %>%
    tokenize_paragraphs() %>%
    unlist() %>%
    tibble(text = .) %>%
    mutate(paragraph = row_number())
}

library(janeaustenr)

northangerabbey_paragraphed <- tibble(text = northangerabbey) %>%
  mutate(chapter = cumsum(str_detect(text, "^CHAPTER "))) %>%
  filter(chapter > 0,
         !str_detect(text, "^CHAPTER ")) %>%
  nest(data = text) %>%
  mutate(data = map(data, add_paragraphs)) %>%
  unnest(cols = c(data))

glimpse(northangerabbey_paragraphed)

## Rows: 1,020
## Columns: 3
## $ chapter   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, …
## $ text      <chr> "No one who had ever seen Catherine Morland in her infancy w…
## $ paragraph <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…

2.2.4.2 Sentences

Convert the fir tree from “one line per element” to “one line per sentence”.

the_fir_tree_sentences <- the_fir_tree %>%
  paste(collapse = " ") %>%
  tokenize_sentences()

head(the_fir_tree_sentences[[1]])

## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet resting-place, grew a pretty little fir-tree; and yet it was not happy, it wished so much to be tall like its companions– the pines and firs which grew around it."
## [2] "The sun shone, and the soft air fluttered its leaves, and the little peasant children passed by, prattling merrily, but the fir-tree heeded them not."                                                                                       
## [3] "Sometimes the children would bring a large basket of raspberries or strawberries, wreathed on a straw, and seat themselves near the fir-tree, and say, \"Is it not a pretty little tree?\""                                                  
## [4] "which made it feel more unhappy than before."                                                                                                                                                                                                
## [5] "And yet all this while the tree grew a notch or joint taller every year; for by the number of joints in the stem of a fir-tree we can discover its age."                                                                                     
## [6] "Still, as it grew, it complained."