2.1 Define “Token”

In tokenization, we take an input text, and break it up into pieces of meaningful sizes. We refer this pieces of texts as tokens.

Most commonly, input texts are broken up into words. But this isn’t perfect.

No white space in certain languages (她是我最好的朋友。- ‘she is my best friend’ in Mandarin)
Pronouns/negation words (Je n’aime pas le chocolat - ‘I don’t like chocolate’ in French)
Contractions of two words (would’ve, didn’t)

Knowing this, let’s take a jab at tokenization. We’ll split on anything that is not an alphanumeric character.

library(tidyverse)
library(hcandersenr)

the_fir_tree <- hcandersen_en %>%
  filter(book == "The fir tree") %>%
  pull(text)

head(the_fir_tree, 9)

## [1] "Far down in the forest, where the warm sun and the fresh air made a sweet"    
## [2] "resting-place, grew a pretty little fir-tree; and yet it was not happy, it"   
## [3] "wished so much to be tall like its companions– the pines and firs which grew" 
## [4] "around it. The sun shone, and the soft air fluttered its leaves, and the"     
## [5] "little peasant children passed by, prattling merrily, but the fir-tree heeded"
## [6] "them not. Sometimes the children would bring a large basket of raspberries or"
## [7] "strawberries, wreathed on a straw, and seat themselves near the fir-tree, and"
## [8] "say, \"Is it not a pretty little tree?\" which made it feel more unhappy than"
## [9] "before."

strsplit(the_fir_tree[1:2], "[^a-zA-Z0-9]+")

## [[1]]
##  [1] "Far"    "down"   "in"     "the"    "forest" "where"  "the"    "warm"  
##  [9] "sun"    "and"    "the"    "fresh"  "air"    "made"   "a"      "sweet" 
## 
## [[2]]
##  [1] "resting" "place"   "grew"    "a"       "pretty"  "little"  "fir"    
##  [8] "tree"    "and"     "yet"     "it"      "was"     "not"     "happy"  
## [15] "it"

This is pretty good, but the hero’s name (fir-tree) has been split. This kind of information loss can be dangerous, and we need to use more delicate splitting methods, rather than brute-forcing with regex.

Luckily, you don’t have to write all the custom logics yourself! This chapter introduces tokenizers

library(tokenizers)
tokenize_words(the_fir_tree[1:2])

## [[1]]
##  [1] "far"    "down"   "in"     "the"    "forest" "where"  "the"    "warm"  
##  [9] "sun"    "and"    "the"    "fresh"  "air"    "made"   "a"      "sweet" 
## 
## [[2]]
##  [1] "resting" "place"   "grew"    "a"       "pretty"  "little"  "fir"    
##  [8] "tree"    "and"     "yet"     "it"      "was"     "not"     "happy"  
## [15] "it"

The word-level tokenization in this package is done by finding word boundaries, which follows a number of sophisticated rules.