3.6 Creating your own stop words list

  • We can create our own stopwords using monolingual corpus
  • Monolingual corpus Vs Multilingual
  • Using Fir-Tree dataset, let’s take the words and rank them by their count or frequency.
tidy_fir_tree %>%
  count(word, sort = TRUE) %>% 
  slice(1:120) %>% 
  mutate(word = paste0(row_number(), ": ", word)) %>%
  pull(word) %>% 
  columnize()

1: the

2: and

3: tree

4: it

5: a

6: in

7: of

8: to

9: i

10: was

11: they

12: fir

13: were

14: all

15: with

16: but

17: on

18: then

19: had

20: is

21: at

22: little

23: so

24: not

25: said

26: what

27: as

28: that

29: he

30: you

31: its

32: out

33: be

34: them

35: this

36: branches

37: came

38: for

39: now

40: one

41: story

42: would

43: forest

44: have

45: how

46: know

47: thought

48: mice

49: trees

50: we

51: been

52: down

53: oh

54: very

55: when

56: where

57: who

58: children

59: dumpty

60: humpty

61: or

62: shall

63: there

64: while

65: will

66: after

67: by

68: come

69: happy

70: my

71: old

72: only

73: their

74: which

75: again

76: am

77: are

78: beautiful

79: evening

80: him

81: like

82: me

83: more

84: about

85: christmas

86: do

87: fell

88: fresh

89: from

90: here

91: last

92: much

93: no

94: princess

95: tall

96: young

97: asked

98: can

99: could

100: cried

101: going

102: grew

103: if

104: large

105: looked

106: made

107: many

108: seen

109: stairs

110: think

111: too

112: up

113: yes

114: air

115: also

116: away

117: birds

118: corner

119: cut

120: did

  • this has “three” as stopwords which does not make sense.
  • How can we solve this issue? -Use large corpus will give a descent one Use multiple genre subject-specific corpus. Stopwords can then be intersection of stopwords in each corpus. Leveraging languages speakers and domain expert
yo <- read_csv("https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Alabi_YorubaTwi_Embedding/alakowe.txt", col_names = FALSE) %>% 
  select(1) %>% 
  rename(text = 1)

# Mean Frequency
yo1_stop_mean <-  yo %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>% 
  mutate(mean_frequency = n/nrow(yo)) %>% 
  mutate(rank_freq = row_number()) %>% 
  filter(n >2) %>% 
  select(word) %>% 
  slice(1:20)

yo1_stop_mean 
## # A tibble: 20 × 1
##    word 
##    <chr>
##  1 ni   
##  2 tí   
##  3 àwọn 
##  4 ní   
##  5 wọ́n  
##  6 ó    
##  7 ti   
##  8 náà  
##  9 tó   
## 10 a    
## 11 ṣe   
## 12 pé   
## 13 yìí  
## 14 sí   
## 15 bá   
## 16 ń    
## 17 àti  
## 18 mo   
## 19 máa  
## 20 fún
yo_two <- read_csv("https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Alabi_YorubaTwi_Embedding/edeyorubarewa.txt", col_names = FALSE)%>% 
  select(1) %>% 
  rename(text = 1)

# Mean Frequency
yo_two_stop <-  yo_two %>%
 unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>% 
  mutate(mean_frequency = n/nrow(yo_two)) %>% 
  mutate(rank_freq = row_number()) %>% 
  filter(n >2) %>% 
  select(word) %>%
  slice(1:20)
  
yo_two_stop
## # A tibble: 20 × 1
##    word 
##    <chr>
##  1 tí   
##  2 ni   
##  3 ó    
##  4 ń    
##  5 àwọn 
##  6 ní   
##  7 yìí  
##  8 ẹ    
##  9 kí   
## 10 fún  
## 11 ọba  
## 12 a    
## 13 pé   
## 14 sí   
## 15 bá   
## 16 ṣe   
## 17 bí   
## 18 tó   
## 19 mo   
## 20 ti
intersect(yo_two_stop,yo1_stop_mean) 
## # A tibble: 16 × 1
##    word 
##    <chr>
##  1 tí   
##  2 ni   
##  3 ó    
##  4 ń    
##  5 àwọn 
##  6 ní   
##  7 yìí  
##  8 fún  
##  9 a    
## 10 pé   
## 11 sí   
## 12 bá   
## 13 ṣe   
## 14 tó   
## 15 mo   
## 16 ti