2.3 Sentiment analysis with inner join
With data in a tidy format, sentiment analysis can be done as an inner join.
text mining as a tidy data analysis task; much as removing stop words is an antijoin operation, performing sentiment analysis is an inner join operation.
austen_books()
## # A tibble: 73,422 × 2
## text book
## * <chr> <fct>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility
## 2 "" Sense & Sensibility
## 3 "by Jane Austen" Sense & Sensibility
## 4 "" Sense & Sensibility
## 5 "(1811)" Sense & Sensibility
## 6 "" Sense & Sensibility
## 7 "" Sense & Sensibility
## 8 "" Sense & Sensibility
## 9 "" Sense & Sensibility
## 10 "CHAPTER 1" Sense & Sensibility
## # … with 73,412 more rows
<- austen_books() %>%
(books group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]",
ignore_case = TRUE)))) %>%
ungroup() )
## # A tibble: 73,422 × 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 "by Jane Austen" Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 "(1811)" Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 "CHAPTER 1" Sense & Sensibility 10 1
## # … with 73,412 more rows
<- books %>%
(tidy_books unnest_tokens(words, text))
## # A tibble: 725,055 × 4
## book linenumber chapter words
## <fct> <int> <int> <chr>
## 1 Sense & Sensibility 1 0 sense
## 2 Sense & Sensibility 1 0 and
## 3 Sense & Sensibility 1 0 sensibility
## 4 Sense & Sensibility 3 0 by
## 5 Sense & Sensibility 3 0 jane
## 6 Sense & Sensibility 3 0 austen
## 7 Sense & Sensibility 5 0 1811
## 8 Sense & Sensibility 10 1 chapter
## 9 Sense & Sensibility 10 1 1
## 10 Sense & Sensibility 13 1 the
## # … with 725,045 more rows
# nrc_joy <- get_sentiments("nrc") # does not work
<- read_tsv("data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", col_names = FALSE) %>%
nrc_emotions_lex rename( "words" = 1, "sentiment" = 2, "score" = 3) %>%
select(-score)
<- nrc_emotions_lex %>%
nrc_joy filter(sentiment == "joy")
nrc_joy
## # A tibble: 14,182 × 2
## words sentiment
## <chr> <chr>
## 1 aback joy
## 2 abacus joy
## 3 abandon joy
## 4 abandoned joy
## 5 abandonment joy
## 6 abate joy
## 7 abatement joy
## 8 abba joy
## 9 abbot joy
## 10 abbreviate joy
## # … with 14,172 more rows
%>%
tidy_books filter(book == "Emma")%>%
inner_join(nrc_joy) %>%
count(words, sort = TRUE)
## Joining, by = "words"
## # A tibble: 3,504 × 2
## words n
## <chr> <int>
## 1 miss 599
## 2 thing 397
## 3 good 359
## 4 time 279
## 5 dear 241
## 6 thought 226
## 7 man 218
## 8 frank 200
## 9 young 192
## 10 day 186
## # … with 3,494 more rows