-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhowitworks.Rmd
247 lines (200 loc) · 11.4 KB
/
howitworks.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
---
title: "How it works"
author: "Denis Rasulev"
date: "April, 2016"
output: html_document
---
### Introduction
Here you will learn complete process from obtaining initial data to working app:
1. Preparation and Download Data
2. Functions Declarations Block
3. Read Data and Create Sample
4. Tokenization and Frequency Tables
5. How it works all together?
6. Shiny App
7. Contacts
### 1.1 Basic Preparation and Setup the Environment
Here we load some libraries (we need only four :)) and setup working directory
```{r setup, eval=FALSE}
# load necessary R librabires
library(tm) # Framework for text mining applications within R
library(NLP) # Basic classes and methods for Natural Language Processing
library(slam) # Data structures and algorithms for sparse arrays and matrices
library(ggplot2) # Implementation of the grammar of graphics in R
# set our working directory
setwd("/Volumes/data/coursera/capstone")
```
### 1.2 Downloading Data
There is a link to download data from the internet. File contains texts for four languages. Out of four available (DE, US, FI, RU) we will use "US" training data for this project. We check if the initial file exists ("Coursera-SwiftKey.zip") and download / unzip it, if required.
```{r download, eval=FALSE}
# check if zip file already exists and download if it is missing
if (!file.exists("./data/Coursera-SwiftKey.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "./data/Coursera-SwiftKey.zip", method = "libcurl")
}
# check if data file exists and unzip it if necessary
if (!file.exists("./data/en_US/en_US.blogs.txt")) {
unzip("Coursera-SwiftKey.zip", exdir = "./data/en_US", list = TRUE)
}
```
### 2. Functions Declarations Block
Then we make some functions that will help us to process the data. These are functions for text cleaning ('clean'), for counting terms frequencies and sorting them in decreasing order (sort.freq) and functions for tokenization (nXgram).
I would like to pay special attention to the cleaning function. It took some time and numerous attempts to prepare the one that gives good and clean text at the end.
* First of all, we convert everything to lower case in order to avoid any duplicates.
* After that we remove all digits and control symbols.
* Then it took some time to prepare sort of dictionary to replace very common apostrophed phrases, like 'you're' -> 'you are'. In fact, all of these was manual work.
* Then we remove all web urls.
* Only after this we remove all punctuation signs! So yes, the order of what remove after what is important.
* After punctuation we replace all misspeled words like 'youre' -> 'you are' and they are of two types: those that can not be part of any word and those that can. So it required different coding to get them all correctly.
* Then it required to execute the very same code at least five times. Reason is simple - after removing digits there were lots of 'hanging' single letters. I didn't find better way to remove them at once, so did it five times. Seems to be working fine. Please, notice that single letter 'I' is not removed as it is important part of the language!
* Next, we do the same for the letters in the beginning of the sentences.
* Finally, we remove leading and trailing spaces.
* Done :)
One important finding. If you create relatively small corpus (<= 1GB) then base function 'rowSums' that is used to calculate frequencies of the terms works ok. As soon as you go for bigger numbers it generates errors. Therefore I have used 'row_sums' function from 'slam' package which was developed to work specifically with sparse matrices. And it works absolutely fine.
Another ineteresting finding was about library to use for tokenization. Majority of the people are using RWeka for these purposes. I have tested RWeka and NLP's package 'ngrams' function and have found that last one works TWICE as fast! Just for information, I am using MacBookPro with 2.8GHz Intel I7 and 16GB RAM. Probably on other platforms RWeka works better. But on MacOSX 'ngrams' allowed to reduce processing time two times.
```{r functions, eval=FALSE}
clean <- function(x)
{
# convert everything to lower case
x <- tolower(x)
# remove numbers and control symbols
x <- gsub("[[:digit:]]", "", x)
x <- gsub("[[:cntrl:]]", "", x)
# replace common apostrophed phrases
x <- gsub("i'm", "i am", x)
x <- gsub("i've", "i have", x)
x <- gsub("it's", "it is", x)
x <- gsub("he's", "he is", x)
x <- gsub("isn't", "is not", x)
x <- gsub("let's", "let us", x)
x <- gsub("she's", "she is", x)
x <- gsub("i'll", "i will", x)
x <- gsub("you're", "you are", x)
x <- gsub("you'll", "you will", x)
x <- gsub("she'll", "she will", x)
x <- gsub("won't", "will not", x)
x <- gsub("we'll", "we will", x)
x <- gsub("he'll", "he will", x)
x <- gsub("it'll", "it will", x)
x <- gsub("can't", "can not", x)
x <- gsub("that's", "that is", x)
x <- gsub("thats", "that is", x)
x <- gsub("don't", "do not", x)
x <- gsub("didn't", "did not", x)
x <- gsub("wasn't", "was not", x)
x <- gsub("weren't", "were not", x)
x <- gsub("they'll", "they will", x)
x <- gsub("couldn't", "could not", x)
x <- gsub("there's", "there is", x)
# remove web addresses and urls
x <- gsub(" www(.+) ", "", x)
x <- gsub(" http(.+) ", "", x)
# remove punctuations marks
x <- gsub("[[:punct:]]", "", x)
# replace misspelled apostrohped phrases, not parts of any word
x <- gsub("isnt", "is not", x)
x <- gsub("youre", "you are", x)
x <- gsub("itll", "it will", x)
x <- gsub("didnt", "did not", x)
x <- gsub("wasnt", "was not", x)
x <- gsub("youll", "you will", x)
x <- gsub("theyll", "they will", x)
x <- gsub("couldnt", "could not", x)
# replace misspelled apostrophed phrases, could be part of word
x <- gsub("\\bim\\b", "i am", x)
x <- gsub("\\bive\\b", "i have", x)
x <- gsub("\\bdont\\b", "do not", x)
x <- gsub("\\bcant\\b", "can not", x)
x <- gsub("\\bwont\\b", "will not", x)
x <- gsub("\\byouve\\b","you have", x)
# remove remaining single letters (repeat 5 times)
x <- gsub("\\b [a-hj-z]\\b", "", x)
x <- gsub("\\b [a-hj-z]\\b", "", x)
x <- gsub("\\b [a-hj-z]\\b", "", x)
x <- gsub("\\b [a-hj-z]\\b", "", x)
x <- gsub("\\b [a-hj-z]\\b", "", x)
# remove single letters in the beginning of sentence
x <- gsub("\\b[a-hj-z]\\b ", "", x)
# remove leading and trailing spaces
x <- gsub("^\\s+|\\s+$", "", x)
return(x)
}
# function for sorting n-grams in decreasing order
sort.freq <- function(x){
srt <- sort(row_sums(x, na.rm = T), decreasing = TRUE)
frf <- data.frame(ngram = names(srt), freq = srt, row.names = NULL, check.rows = TRUE, stringsAsFactors = FALSE)
return(frf)
}
# functions for tokenization
n1gram <- function(x) unlist(lapply(ngrams(words(x), 1), paste, collapse = " "), use.names = FALSE)
n2gram <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
n3gram <- function(x) unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
n4gram <- function(x) unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
```
### 3. Read Data and Create Sample
As our raw data is quite big, we will create smaller sample combining data from all 3 files (news, blogs, twits). After that we will clean it from everything that is not required or is not useful for our purposes, namely - digits, punctuation, control characters etc. 'numLines' is a variable which says how much information you want. The more it is, the bigger will be your corpus and longer it will take to process everything. So, we read data in, create one combined sample, clean it and create temporary corpus.
```{r sample, eval=FALSE}
# read data
numLines <- 5000
news <- readLines("./data/en_US/en_US.news.txt", numLines, encoding = "UTF-8")
blogs <- readLines("./data/en_US/en_US.blogs.txt", numLines, encoding = "UTF-8")
twits <- readLines("./data/en_US/en_US.twitter.txt", numLines, encoding = "UTF-8")
# combine into single block and clean
sample <- c(news, blogs, twits)
sample <- clean(sample)
# create our corpus and clean it
corp <- VCorpus(VectorSource(sample))
corp <- tm_map(corp, stripWhitespace)
corp <- tm_map(corp, removeWords, stopwords("english"))
# clear some memory
rm(sample, news, blogs, twits)
```
### 4. Tokenization and Frequency Tables
At this stage we will do the following 4 times as I am using up to 4grams model:
- Create Term Document Matrix.
- Tokenize it with according function.
- Remove very rare terms from TDM.
- Count terms frequencies and sort them in decreasing order.
- Save N-gram table to file system.
- This is N-grams model.
```{r corpus, eval=FALSE}
# create term document matrix for unigrams, reduce sparsity and save
tdm1 <- TermDocumentMatrix(corp, control = list(tokenize = n1gram))
tdm1 <- removeSparseTerms(tdm1, 0.9999)
frq1 <- sort.freq(tdm1)
saveRDS(frq1, file = "data1.RDS")
# create term document matrix for bigrams, reduce sparsity and save
tdm2 <- TermDocumentMatrix(corp, control = list(tokenize = n2gram))
tdm2 <- removeSparseTerms(tdm2, 0.9999)
frq2 <- sort.freq(tdm2)
saveRDS(frq2, file = "data2.RDS")
# create term document matrix for trigrams, reduce sparsity and save
tdm3 <- TermDocumentMatrix(corp, control = list(tokenize = n3gram))
tdm3 <- removeSparseTerms(tdm3, 0.9999)
frq3 <- sort.freq(tdm3)
saveRDS(frq3, file = "data3.RDS")
# create term document matrix for fourgrams, reduce sparsity and save
tdm4 <- TermDocumentMatrix(corp, control = list(tokenize = n4gram))
tdm4 <- removeSparseTerms(tdm4, 0.9999)
frq4 <- sort.freq(tdm4)
saveRDS(frq4, file = "data4.RDS")
```
### 5. How it works all together?
Now that we have generated most statistically frequent 1-, 2-, 3- and 4-grams and saved them all to file system, let me briefly explan how works Shiny App.
- First of all, app loads our tables that we have generated earlier.
- Then, when user enters some words into text input field, application takes this input and cleans it.
To predict the next word, app:
- Searches 4grams for entry starting with the last three words of user input
- If nothing was found, it searches 3grams for the last two words of user input
- If nothing was found, it searches 2grams for the last word of user input
- If nothing found again, app uses unigram with the highest frequency
That's basically it. While working on this project, I have learned a lot of new and interesting things. Many ideas were generated and carefully written to notebook in order to work on them further. For instance, can we use 'trie' structures and will they be more efficient in our case? Some features were not realized, like spell checking user entry, error messaging and many other due to limited time. But all of them are in the notebook :)
### 6. Shiny App
You can find working app here: [ShinyApp](https://www.linkedin.com/in/denisrasulev)
### 7. Contacts
Please, fill free to contact me if you have any questions / comments / requests / projects to work on / etc :)
- [Facebook](https://www.facebook.com/denis.rasulev)
- [LinkedIn](https://www.linkedin.com/in/denisrasulev)
- [Twitter](https://twitter.com/drasulev)
- [GitHub](https://github.com/denrasulev)
- [Pinterest](https://www.pinterest.com/denisrasulev)