Abstract

In this project we will try to develop a word prediction system at typing. For this task we are provided by the JHU partner SwiftKey with a large corpus of texts. We are supposed to use these texts to extract some frequent combinations of words we use in daily life to build our prediction algorithm from.

Summaries

First, let’s see the scale of the problem. Since the files are very large and R is only able to work in-memory, I will use operating system’s wc operation to count lines, words and characters.

file size.MiB lines words chars wordsPerLine charsPerLine
en_US.blogs.txt 200 899,288 37,334,117 210,160,014 42 234
en_US.news.txt 196 1,010,242 34,365,936 205,811,889 34 204
en_US.twitter.txt 159 2,360,148 30,373,559 167,105,338 13 71
Total 555 4,269,678 102,073,612 583,077,241 24 137

As one can see the numbers are truly enormous. I will try to use a 1% of the lines to perform some preliminary analyses. Also, looking ahead, running the analysis on all three files yelded pretty much similar results, so I will present it for only ‘en_US.blogs.txt’ file to avoid clogging up the report with duplicate information.

Creating corpus

1%
en_US.blogs.txt 90000
en_US.news.txt 101000
en_US.twitter.txt 236000
Total 427000

For each file 1% of the lines is roughly 100,000 lines, a sizeable sample, but not too big, allowing to execute operations in a matter of seconds and without heavy load on memory. So in this step I:
-Read 100,000 random lines from a file
-Convert everything to lower case and remove foreign languages (non-ASCII) characters
-Create corpus object

currentFile = 'en_US.blogs.txt'
if (is.na(get_lines(currentFile,1:2)[1])){stop("cant read from file. probably path is wrong")}# check that I can read from file
set.seed(42)
Lines <- sample_lines(currentFile, 1e4, files[currentFile,]$lines) #sample some lines from a file
Encoding(Lines) <- "latin1"  #remove non-ascii chars 
Lines <- iconv(Lines, "latin1", "ASCII", sub="")
Lines <- gsub('_+', ' ', Lines, perl=T) #replace all underscores (including multiple _____) with whitespace
Corp <- corpus(char_tolower(Lines)) # everything to lowerCase and create corpus
writeLines(head(texts(Corp)))
sea salt and cracked pepper

i believe this is an experience that every single study abroad student has on the first friday that they spend in morocco:

but back to the on-point-former-sewing-machine-cover. i ended up needing to trim it quite a bit and add borders in order to square it up, but i think i like it even more bordered than without. i think it will eventually become a mini wall-hanging, probably to be hung near my sewing machine.

they were.

even though theres no risk of montezumas revenge, fransisco still drowns the dish in lime juice, a tradition in mexico to keep bad bacteria at bay.

while forest city ratner says it owns or controls 90 percent of the project site, that percentage is a bit misleading; of ten buildings that still contain rental tenants, six are owned by the developer. even though those six buildings are counted in the 90 percent, fcr has not yet gotten tenants in those buildings to move.

Tokenization

Creating word tokens, while removing unwanted tokens:
- numbers
- punctuation
- symbols
- separators
- hyphens
- URLs
- twitter tags (but leaving the word after # in #hashtag)

Toks<- tokens(Corp, what = "word", remove_numbers = T, remove_punct = T, remove_symbols = T,
  remove_separators = T, remove_twitter = T, remove_hyphens = T, remove_url = T)

And:
- twitter rt tag
- english stopwords
- profane words

l <- determine_nlines('profanity_words.txt') #read profanity vocabulary and convert to lower
profane <-char_tolower(get_lines('profanity_words.txt', 1:l))
profane <- profane[!is.na(profane)]
Toks<- tokens_remove(Toks, c(stopwords("english"), profane, "rt"))  # also remove ReTweet tag

Stemming

Stemming is an operation of converting word to its root form. Example: caresses -> caress.
I decided not to do stemming because:
- it converts some words in a strange manner, like: ponies -> poni.
- the model will output stemmed words as prediction which is undesirable.

Document Frequency Matrix

A DFM is a matrix that stores number of appearances of each token. Essentially, it extracts all the unique words from a text together with total number of appearances of this word. At this step DFM is useful to spot unwanted words.

Dfm1gr <- dfm(Toks, tolower = TRUE, stem = F) 

Check that non-ASCII characters actually got removed

If present, these chars should be immediately seen at the beginning or at the end of sorted tokens list.

## 0.01p; 0071jke; 00am; 00pm; 00p.m; 00s; 0.75l; 0.7kg; 08th; 0.9l; 0f; 1000yen; 100f; 100gr; 100s; 100th; 105s; 1080p; 10am; 10ears; 10k; 10mg; 10pm; 10s; 10th; 10x; 11.00am; 113b; 11am; 11dp5dt; 11pm; 11th; 12,000ft; 1200ft; 120v; 12,16,18,30,40yrs; 12pm; 12pt; 12sg; 12th; 130m; 133gms; 13lbs; 13th; 140gms; 14th; 1500m; 15for; 15ft; 15km
## zenithfilms; zenn; zentangle; zeppelin; zero; zest; zibbet; zich; ziga; zimbabwe; zimbabweans; zimmerman; zimmern; zinc; zine; zines; zinns; ziolkowski; zip; zipped; zipper; zipperthe; zips; zodiac; zoe; zoey; zombi; zombie; zombies; zone; zones; zoning; zoo; zooey; zooga; zoom; zotten; zp; zrn; zta; zub's; zucca; zucchini; zuckerberg; zuffa; zulfs; zulu; zuma; zunami; zymurgy

And I see that there are none. Even though there are some pseudo-words here, like “100,00th”, these will be filtered out at a later step as having too low a frequency in the n-grams.

Statistics

Since the data is very big, I have to find ways to make it smaller. First of all, it will become smaller due to structure of the model which compressees the data to just a database of unique tokens with their frequencies. Another way is to filter out terms that are of very low frequency.
For this I need to determine how much data can be filtered out and still cover 90% of the words.

freq1gr <- colSums(Dfm1gr)
freq1gr <- data.frame(words = names(freq1gr), freq = freq1gr , stringsAsFactors = F)
freq1gr  <- freq1gr[order(freq1gr$freq, decreasing = T),]
freq1gr$wc <- 1:nrow(freq1gr)
freq1gr$cs <- cumsum(freq1gr$freq)
freq1gr$perc <- freq1gr$cs/tail(freq1gr$cs,1)

closest <- function(w,x) #find closest value to x in a vector w. w must be sorted.
{
  dt = data.table(w, val = w) 
  setattr(dt, "sorted", "w")
  return(dt[J(x), .I, roll = "nearest", by = .EACHI])
}

cl<-closest(freq1gr$perc,c(.5, .9, .99))
cl$p <- round((cl$I/nrow(freq1gr))*100)
cl$col <- brewer.pal(3, "Set1")
coverage <- ggplot(data = freq1gr ,aes(x = wc, y = perc)) + geom_line() +
              geom_hline(aes(yintercept =  w, col = col),data = cl, show.legend = F)  +
              geom_label(aes(x = I, y = w+.04, label=paste(I, " ≈ ", p ,"%"), col = col ), data = cl, show.legend = F) + xlab("num of unique words") + ylab("% most frequent tokens") 
coverage

So I see that the coverage graph grows very fast. About 20% most frequent tokens cover 90% of all unique words in a 100k lines sample.

N-grams

N-grams are a sequences of n words taken from a text and incorporated as a single token.
So if a text was “The quick brown fox jumps over another quick brown fox” then 3-grams for it would be:

The_quick_brown
     quick_brown_fox
           brown_fox_jumps
                 fox_jumps_over
                     jumps_over_another
                           over_another_quick
                                another_quick_brown
                                        quick_brown_fox

The next step is to create Document-feature matrix (DFM) out of these 3-grams. So for example in the case of 3-grams each token represents 3 consecutive words. So all the tokens in our example will have 1 appearance except for “quick_brown_fox”, which will have 2.

I will create 1, 2 and 3-grams for the document to get the general picture of the data. 1-gram is just dfm of single words which I’ve already created earlier.

bigrams <- tokens_ngrams(Toks, 2)
trigrams <- tokens_ngrams(Toks, 3)

Dfm2gr <- dfm(bigrams, tolower = TRUE, stem = F)
Dfm3gr <- dfm(trigrams, tolower = TRUE, stem = F)

A plot of most frequent 1,2,3-grams shows us that even such simple algorithm is indeed able to locate frequent combinations of words we use in daily life.

tf <- topfeatures(Dfm2gr,30, scheme =  "count")
tf2 <- data.frame(words = names(tf), freq = tf , stringsAsFactors = F)

tf <- topfeatures(Dfm3gr,30, scheme =  "count")
tf3 <- data.frame(words = names(tf), freq = tf , stringsAsFactors = F)
                   
plt1grm <- ggplot(head(freq1gr,30), aes(x=reorder(words, freq), y=freq)) + 
        geom_bar(stat="Identity", fill='red') + xlab("") +
        coord_flip() + ylab("Number of Appearances") + ggtitle("most frequent 1grams") 

plt2grm <- ggplot(head(tf2,30), aes(x=reorder(words, freq), y=freq)) + 
        geom_bar(stat="Identity", fill='red') + xlab("") +
        coord_flip() + ylab("Number of Appearances") + ggtitle("most frequent 2grams") 
        
plt3grm <- ggplot(head(tf3,30), aes(x=reorder(words, freq), y=freq)) + 
        geom_bar(stat="Identity", fill='red') + xlab("") +
        coord_flip() + ylab("Number of Appearances") + ggtitle("most frequent 3grams") 
      
grid.arrange(plt1grm, plt2grm, plt3grm, ncol=3)

Questions to consider

Here are some preprocessing steps which could further improve the model, but I decided to skip them because they will require a lot of time to develop.
- Do a segmentation of twitter hashtags into separate words (like #hashtag -> hash tag).
- Substitute or remove tokens which start with a number (like “10,000th” -> “th”)
- Maybe better not to remove stopwords since they are an essential part of speech and without them the sentences will be syntactically incorrect.

Next steps

So judging from this overview, a straightforward solution to the project’s goal seems simple:
- Pre-calculate a frequencies database for all 1,2,3,4, maybe 5-grams.
- For every new word in a sequence that a user types, find in the database all n-grams that begin similarly
- Calculate a degree of fit for every found n-gram according to some formula
- Select n-gram with best fit.

Implementation

Since the given data is truly enormous and can’t fit into memory, I will have to process it in chunks, saving only ngrams and frequencies in an SQL database.

Conclusion

In this report I have performed a brief overview of the data that is avilable and arrived at an idea as to how it may be useful for creating a prediction model.