3.4 Using off-the-shelf stop word lists

  • A quick option for using stop words is to get a list that has already been created.

  • There are many lits available, but not all lists are created equal

  • Quanteda provides multilingual stopwords

library(quanteda)
stopwords::stopwords_getsources()
## [1] "snowball"      "stopwords-iso" "misc"          "smart"        
## [5] "marimo"        "ancient"       "nltk"          "perseus"
  • Get languages supported by a stopwords
stopwords::stopwords_getlanguages("snowball")
##  [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
stopwords::stopwords_getlanguages("nltk")
##  [1] "ar" "az" "da" "nl" "en" "fi" "fr" "de" "el" "hu" "id" "it" "kk" "ne" "no"
## [16] "pt" "ro" "ru" "sl" "es" "sv" "tg" "tr"
stopwords::stopwords_getlanguages("stopwords-iso")
##  [1] "af" "ar" "hy" "eu" "bn" "br" "bg" "ca" "zh" "hr" "cs" "da" "nl" "en" "eo"
## [16] "et" "fi" "fr" "gl" "de" "el" "ha" "he" "hi" "hu" "id" "ga" "it" "ja" "ko"
## [31] "ku" "la" "lt" "lv" "ms" "mr" "no" "fa" "pl" "pt" "ro" "ru" "sk" "sl" "so"
## [46] "st" "es" "sw" "sv" "th" "tl" "tr" "uk" "ur" "vi" "yo" "zu"
  • Default stopword in Quanteeda is snowball. Why?
length(stopwords::stopwords(source = "smart"))
## [1] 571
length(stopwords::stopwords(source = "snowball"))
## [1] 175
length(stopwords::stopwords(source = "stopwords-iso"))
## [1] 1298
  • These stopwords do intersect

  • Bt, words that appear in Snowball and ISO but not in the SMART list.

setdiff(stopwords(source = "snowball"),
        stopwords(source = "smart"))
##  [1] "she's"   "he'd"    "she'd"   "he'll"   "she'll"  "shan't"  "mustn't"
##  [8] "when's"  "why's"   "how's"