3.6 Creating your own stop words list
- We can create our own stopwords using monolingual corpus
- Monolingual corpus Vs Multilingual
- Using Fir-Tree dataset, let’s take the words and rank them by their count or frequency.
tidy_fir_tree %>%
count(word, sort = TRUE) %>%
slice(1:120) %>%
mutate(word = paste0(row_number(), ": ", word)) %>%
pull(word) %>%
columnize()
1: the
2: and
3: tree
4: it
5: a
6: in
7: of
8: to
9: i
10: was
11: they
12: fir
13: were
14: all
15: with
16: but
17: on
18: then
19: had
20: is
21: at
22: little
23: so
24: not
25: said
26: what
27: as
28: that
29: he
30: you
31: its
32: out
33: be
34: them
35: this
36: branches
37: came
38: for
39: now
40: one
41: story
42: would
43: forest
44: have
45: how
46: know
47: thought
48: mice
49: trees
50: we
51: been
52: down
53: oh
54: very
55: when
56: where
57: who
58: children
59: dumpty
60: humpty
61: or
62: shall
63: there
64: while
65: will
66: after
67: by
68: come
69: happy
70: my
71: old
72: only
73: their
74: which
75: again
76: am
77: are
78: beautiful
79: evening
80: him
81: like
82: me
83: more
84: about
85: christmas
86: do
87: fell
88: fresh
89: from
90: here
91: last
92: much
93: no
94: princess
95: tall
96: young
97: asked
98: can
99: could
100: cried
101: going
102: grew
103: if
104: large
105: looked
106: made
107: many
108: seen
109: stairs
110: think
111: too
112: up
113: yes
114: air
115: also
116: away
117: birds
118: corner
119: cut
120: did
- this has “three” as stopwords which does not make sense.
- How can we solve this issue? -Use large corpus will give a descent one Use multiple genre subject-specific corpus. Stopwords can then be intersection of stopwords in each corpus. Leveraging languages speakers and domain expert
yo <- read_csv("https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Alabi_YorubaTwi_Embedding/alakowe.txt", col_names = FALSE) %>%
select(1) %>%
rename(text = 1)
# Mean Frequency
yo1_stop_mean <- yo %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
mutate(mean_frequency = n/nrow(yo)) %>%
mutate(rank_freq = row_number()) %>%
filter(n >2) %>%
select(word) %>%
slice(1:20)
yo1_stop_mean
## # A tibble: 20 × 1
## word
## <chr>
## 1 ni
## 2 tí
## 3 àwọn
## 4 ní
## 5 wọ́n
## 6 ó
## 7 ti
## 8 náà
## 9 tó
## 10 a
## 11 ṣe
## 12 pé
## 13 yìí
## 14 sí
## 15 bá
## 16 ń
## 17 àti
## 18 mo
## 19 máa
## 20 fún
yo_two <- read_csv("https://raw.githubusercontent.com/Niger-Volta-LTI/yoruba-text/master/Alabi_YorubaTwi_Embedding/edeyorubarewa.txt", col_names = FALSE)%>%
select(1) %>%
rename(text = 1)
# Mean Frequency
yo_two_stop <- yo_two %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
mutate(mean_frequency = n/nrow(yo_two)) %>%
mutate(rank_freq = row_number()) %>%
filter(n >2) %>%
select(word) %>%
slice(1:20)
yo_two_stop
## # A tibble: 20 × 1
## word
## <chr>
## 1 tí
## 2 ni
## 3 ó
## 4 ń
## 5 àwọn
## 6 ní
## 7 yìí
## 8 ẹ
## 9 kí
## 10 fún
## 11 ọba
## 12 a
## 13 pé
## 14 sí
## 15 bá
## 16 ṣe
## 17 bí
## 18 tó
## 19 mo
## 20 ti
## # A tibble: 16 × 1
## word
## <chr>
## 1 tí
## 2 ni
## 3 ó
## 4 ń
## 5 àwọn
## 6 ní
## 7 yìí
## 8 fún
## 9 a
## 10 pé
## 11 sí
## 12 bá
## 13 ṣe
## 14 tó
## 15 mo
## 16 ti