31 January 2019

What is this

SlothKey is an app that suggests you words as you type so that you can type less and write more!



App usage is really simple and intuitive. Just type your thoughts in the text box and the App will suggest words each time you press space or select a suggestion. Clicking on a suggestion will add it to the end of your text. Convenient!
If neither of the suggestions fit you, just continue on typing.

If you've spent some time playing around with the app, you might be curious, how does it make such good suggestions? There must be some quite complicated machine leaning algorithms in use here!? Well, no. The whole app relies on a lot of training data and 2 simple ideas: n-grams and stupid backoff.

N-grams

N-grams are combinations of n words taken from a text. So if a text was "brown fox jumped over another brown fox" then 2grams for it would be:

brown_fox
       fox_jumped
           jumped_over
                  over_another
                       another_brown
                               brown_fox

Where all 2grams will have frequency 1 except "brown_fox", which will have 2.
When this approach is applied to a vast corpus of text, you get a large database which contains a lot of common phrases and their frequency in the speech.

Now then, to predict the next word a user will type, you take n last words he already typed and look for n+1 gram that begins with these words. The last word of the found ngram is your prediction.

But this raises several questions:
- What if a user types a phrase so rare it does not exist in our database?
- If several predictions were found, how do you determine the best one to suggest?

The second idea: Stupid backoff

The idea is simple and answers both questions at the same time: if the algorithm doesn't find ngram that begins with the given phrase, it looks for n-1gram that begins with the same words but without the first word. If n-1gram was not found either, back off to n-2gram and so on until 1grams are reached, at which case just the most frequent words are returned as suggestion.
This process is expressed in the second case of the following example formula:

\[S(\text{tips }|\text{ here are a few})= \begin{cases} f(\text{here are a few tips})\over f(\text{here are a few}) &, \text{if } f(\text{here are a few tips}) >0 \\ αS(\text{tips }|\text{ are a few}) &, \text{otherwise}. \end{cases}\]

Where \(S\) is a score of a suggestion, \(f\) is frequency of an ngram and \(α\) is backoff factor that equalizes the scores between different ngram orders.
The first case of formula calculates scores of the found ngrams which allows to select single best suggestion for the input.

Performance

The following report sums up the main performance characteristics of the model. While 13% precision seems low, it's perceived accuracy is close to other similar solutions present on the market. Such precision is actually quite close to the upper bound of what is possible with a backoff model.

Overall top-3 score:     17.54 %
Overall top-1 precision: 13.10 %
Overall top-3 precision: 21.30 %
Average runtime:         114.39 msec
Number of predictions:   29096
Total memory used:       1.99 MB
 
Dataset details
  Dataset "blogs" (599 lines, 14587 words, hash 5c3fe67d1857e2d)
  Score: 17.71 %, Top-1 precision: 13.03 %, Top-3 precision: 21.66 %
  Dataset "tweets" (793 lines, 14071 words, hash 5debebc89c18e70)
  Score: 17.37 %, Top-1 precision: 13.17 %, Top-3 precision: 20.94 %

A 100 millisecond delay per suggestion is barely acceptable in terms of user experience. (Hence, the name (⊙ω⊙) ) It may be further improved, but this is out of the scope for current work.




Questions?