2.3 Where does tokenization break down?

Tokenization is a crucial first step to any kind of text analysis. Defaults work well for the most part, but we do have to make decisions carefully.

Don’t forget you owe the bank $1 million for the house.

  • Don’t: 1 word? or “do” & “n’t”?
  • $1 & .: strip_punct?

Context matters.

On Twitter, you’ll run into grammatically incorrect sentences with multiple spaces, deliberate capitalization, different styles, … You may not be worried, if you’re just interested in what words are used. However, you may need to be more careful, if you’re doing a social grouping analysis.

Another thing to consider is the degree of compression & speed each tokenizing methods provide.

  • You don’t want to choose a method that gives fewer tokens, just because it’s faster. You may lose some information.