2.3 Sentiment analysis with inner join

  • With data in a tidy format, sentiment analysis can be done as an inner join.

  • text mining as a tidy data analysis task; much as removing stop words is an antijoin operation, performing sentiment analysis is an inner join operation.

austen_books() 
## # A tibble: 73,422 × 2
##    text                    book               
##  * <chr>                   <fct>              
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility
##  2 ""                      Sense & Sensibility
##  3 "by Jane Austen"        Sense & Sensibility
##  4 ""                      Sense & Sensibility
##  5 "(1811)"                Sense & Sensibility
##  6 ""                      Sense & Sensibility
##  7 ""                      Sense & Sensibility
##  8 ""                      Sense & Sensibility
##  9 ""                      Sense & Sensibility
## 10 "CHAPTER 1"             Sense & Sensibility
## # … with 73,412 more rows
(books <- austen_books() %>%
  group_by(book) %>%
  mutate(linenumber = row_number(),
         chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", 
                                      ignore_case = TRUE)))) %>%
                                      ungroup() )
## # A tibble: 73,422 × 4
##    text                    book                linenumber chapter
##    <chr>                   <fct>                    <int>   <int>
##  1 "SENSE AND SENSIBILITY" Sense & Sensibility          1       0
##  2 ""                      Sense & Sensibility          2       0
##  3 "by Jane Austen"        Sense & Sensibility          3       0
##  4 ""                      Sense & Sensibility          4       0
##  5 "(1811)"                Sense & Sensibility          5       0
##  6 ""                      Sense & Sensibility          6       0
##  7 ""                      Sense & Sensibility          7       0
##  8 ""                      Sense & Sensibility          8       0
##  9 ""                      Sense & Sensibility          9       0
## 10 "CHAPTER 1"             Sense & Sensibility         10       1
## # … with 73,412 more rows
(tidy_books <- books %>% 
  unnest_tokens(words, text))
## # A tibble: 725,055 × 4
##    book                linenumber chapter words      
##    <fct>                    <int>   <int> <chr>      
##  1 Sense & Sensibility          1       0 sense      
##  2 Sense & Sensibility          1       0 and        
##  3 Sense & Sensibility          1       0 sensibility
##  4 Sense & Sensibility          3       0 by         
##  5 Sense & Sensibility          3       0 jane       
##  6 Sense & Sensibility          3       0 austen     
##  7 Sense & Sensibility          5       0 1811       
##  8 Sense & Sensibility         10       1 chapter    
##  9 Sense & Sensibility         10       1 1          
## 10 Sense & Sensibility         13       1 the        
## # … with 725,045 more rows
# nrc_joy <- get_sentiments("nrc") # does not work

nrc_emotions_lex <- read_tsv("data/NRC-Emotion-Lexicon-Wordlevel-v0.92.txt", col_names = FALSE) %>%
  rename( "words" = 1,  "sentiment" = 2, "score" = 3) %>% 
   select(-score)
nrc_joy <- nrc_emotions_lex %>% 
  filter(sentiment == "joy")

nrc_joy
## # A tibble: 14,182 × 2
##    words       sentiment
##    <chr>       <chr>    
##  1 aback       joy      
##  2 abacus      joy      
##  3 abandon     joy      
##  4 abandoned   joy      
##  5 abandonment joy      
##  6 abate       joy      
##  7 abatement   joy      
##  8 abba        joy      
##  9 abbot       joy      
## 10 abbreviate  joy      
## # … with 14,172 more rows
tidy_books %>%
  filter(book == "Emma")%>%
  inner_join(nrc_joy) %>%
  count(words, sort = TRUE)
## Joining, by = "words"
## # A tibble: 3,504 × 2
##    words       n
##    <chr>   <int>
##  1 miss      599
##  2 thing     397
##  3 good      359
##  4 time      279
##  5 dear      241
##  6 thought   226
##  7 man       218
##  8 frank     200
##  9 young     192
## 10 day       186
## # … with 3,494 more rows