Saturday, March 24, 2018

My wrap-up after a NLP machine learning competition

I recently participated in a natural language processing (NLP) machine learning competition, and there were some interesting learning’s that I think are worth sharing. NLP is a bit different from my usual topics but I think machine learning is the most interesting thing currently happening in supply chain software and working with text is an important part of machine learning.

The competition was Jigsaw Toxic Comment, with over 4500 teams competing and my team got a great 5th place. The goal was to build a classification system able to classify different comments with given types "toxic/insult/obscene/etc" or as clean.

The best results were 0.9885 AUC which is impressive, it is another task were the machine is at the same level as humans.

To achieve these results all top teams used deep learning, the best models were almost always recurrent neural networks (LSTM or GRU) using pre-trained word vectors. In case you don't know, word vectors are mappings between words and large dimensional vectors (300 dimensions is common), where these vectors result from training in very large collections of text (like wikipedia). In the word vector space, words with similar meaning have close vectors and to some extend the distance between words in the space relate to the concepts. One typical example is that if we take the vector for the work "King" and we add the vector for "Woman" we get a vector that is the closest to the word "Queen". This last example is known to be a bit cherry-picked and word vectors are not yet so perfect but it is for sure one of the best tools in NLP. And there are many good vectors to choose from, so as always in machine learning, combining all of them leads to a better result.

One interesting fact from the data is that even though some word vectors have more then 2 million words, we could find in the comments texts a very large number of words missing in the vectors (around 30%). It is because of bad spelling, using foreign language words and in many cases it is on purpose that people write "heeeyyyy", "d0n't" or "bs'ing". One thing that helps for these cases are subword embeddings like Fasttext that do training in parts of words and based on that can build vectors for unknown words by using the smaller pieces.

Since the comments are full of misspelled words and non-letter characters it could be thought that doing a lot of pre-processing to clean the text and fix the misspellings would improve the results. This was not the case in my experience and other teams also reported the same conclusion. Combination of subword embeddings and the ability of neural networks to internally learn the necessary filtering seem to be better.

Deep learning systems are hungry for data, and if we give them more data we will probably get better results. An interesting trick I learned was that we could easily use the translation technology to get some small variations of the texts. For example using the TextBlob Python package it is as easy as doing

new_text = TextBlob(text).translate(to="de").translate(to="en")

And we get a new text that resulted from translating the original English text to German, and then translating it back to English (using Google webservice). In few cases the result is exactly the same but in most cases there is some small change, different words but hopefully the same meaning.

Data generated this way can be used for additional training of models or can be used for what is called test time data augmentation. This last concept is quite simple to explain, it is often used in image classification where rotating an image sometimes makes it easier for the model to make better predictions. So doing several rotations and then averaging predictions will lead to a better result. With text it also worked quite well by using the different translations variants.

From my experience the language that worked best for this trick was German, then next one was Spanish. I also tried a few other European languages like Portuguese, French or Swedish and also something much different as Japanese.

I did not try it but other teams reported that just doing translations and training with the word vectors for the other language also improved the results.

It was great fun this competition. My team was absolutely amazing.