2.3 Where does tokenization break down?
Tokenization is a crucial first step to any kind of text analysis. Defaults work well for the most part, but we do have to make decisions carefully.
Don’t forget you owe the bank $1 million for the house.
- Don’t: 1 word? or “do” & “n’t”?
- $1 & .: strip_punct?
Context matters.
On Twitter, you’ll run into grammatically incorrect sentences with multiple spaces, deliberate capitalization, different styles, … You may not be worried, if you’re just interested in what words are used. However, you may need to be more careful, if you’re doing a social grouping analysis.
Another thing to consider is the degree of compression & speed each tokenizing methods provide.
- You don’t want to choose a method that gives fewer tokens, just because it’s faster. You may lose some information.