4.3 Counting and correlating pairs of words with widyr
Understanding where words co-occur in documents, even though they may not occur next to each other is another piece of useful information.
widyr
= a package to make matrix operations, like pairwise counts or correlations between two words in the same document, on tidy data easier.
For the purposes of our covid example, lets look at only tweets that are “Extremely Positive” and then group them into chunks of 10 tweets. By the way, these tweets span between 03/01/2020 to 04/01/2020. It would have probably been better to convert the TweetAt
variable into a date and sort so that we can have chunks in chronological order…
<- covid_tweets %>%
covid_chunk_words filter(Sentiment == "Extremely Positive") %>%
mutate(chunk = row_number() %/% 10) %>%
filter(chunk > 0) %>%
unnest_tokens(word, OriginalTweet) %>%
filter(!word %in% stop_words$word)
covid_chunk_words
## # A tibble: 123,673 × 7
## UserName ScreenName Location TweetAt Sentiment chunk word
## <dbl> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 share
## 2 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 65
## 3 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 living
## 4 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 struggling
## 5 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 2
## 6 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 local
## 7 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 supermarket
## 8 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 due
## 9 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 issues
## 10 3845 48797 Drogheda 16-03-2020 Extremely Positive 1 19
## # … with 123,663 more rows
library(widyr)
# count words co-occuring within sections
<- covid_chunk_words %>%
word_pairs pairwise_count(word, chunk, sort = TRUE)
## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
word_pairs
## # A tibble: 9,056,928 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 t.co https 653
## 2 https t.co 653
## 3 https coronavirus 639
## 4 t.co coronavirus 639
## 5 coronavirus https 639
## 6 coronavirus t.co 639
## 7 covid 19 621
## 8 https 19 621
## 9 t.co 19 621
## 10 19 covid 621
## # … with 9,056,918 more rows
Most common pair of words are links, but outside of those is “covid-19” and “store”
%>%
word_pairs filter(item1 == "store")
## # A tibble: 19,409 × 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 store https 543
## 2 store t.co 543
## 3 store coronavirus 538
## 4 store 19 521
## 5 store covid 514
## 6 store grocery 492
## 7 store supermarket 431
## 8 store amp 412
## 9 store food 403
## 10 store prices 389
## # … with 19,399 more rows
4.3.1 Pairwise correlation
# we need to filter for at least relatively common words first
<- covid_chunk_words %>%
word_cors group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, chunk, sort = TRUE)
word_cors
## # A tibble: 761,256 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 t.co https 1
## 2 https t.co 1
## 3 paper toilet 0.857
## 4 toilet paper 0.857
## 5 covid 19 0.841
## 6 19 covid 0.841
## 7 doctors nurses 0.819
## 8 nurses doctors 0.819
## 9 sanitizer hand 0.783
## 10 hand sanitizer 0.783
## # … with 761,246 more rows
Let’s find words most correlated to “store”
%>%
word_cors filter(item1 == "store")
## # A tibble: 872 × 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 store grocery 0.649
## 2 store employees 0.185
## 3 store workers 0.171
## 4 store weeks 0.113
## 5 store retail 0.110
## 6 store service 0.110
## 7 store coronacrisis 0.109
## 8 store clerks 0.109
## 9 store community 0.108
## 10 store 30 0.106
## # … with 862 more rows
%>%
word_cors filter(item1 %in% c("store", "workers", "online", "sanitizer")) %>%
group_by(item1) %>%
slice_max(correlation, n = 6) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()