What is the main difference between word2vec and FastText?

If you put a status update on Facebook about purchasing a smartphone – don’t be surprised if Facebook serves you a smartphone ad on your screen. This is not black magic, this is Facebook leveraging the text data to serve you relevant ads.

One of the biggest problem, not only for Facebook, using this approach has always been the disambiguation of similar, or identical, words. That disambiguation is possible only considering the context and not just the word itself.

Imagine the challenge for Facebook: Facebook deals with an enormous amount of text data on a daily basis in the form of status updates, comments etc. And it is all the more important for Facebook to utilize this text data to serve its users better. And using this text data generated by billions of users to compute word representations was a very time expensive task until Facebook developed their own open source library, FastText, for Word Representations and Text Classification.

FastText is a library created by the Facebook Research Team for efficient learning of word representations and sentence classification. The library has gained a lot of traction in the NLP community and is a possible substitution to the Gensim package which provides the functionality of Word Vectors.

Let’s start to ask our self, why we should find a difference between the two libraries? The basic idea is to convert a word to a vector or “array of numbers” as a simple mechanism to input and process words for any natural language processing task. Turning word to vec is the modern approach in Natural Language Processing to analyze a text corpus in both ways: supervised and unsupervised. Representing words in vectors (numbers) has a lot of different benefits, including the possibility to use modern deep machine learning models built in libraries such Gensim, TensorFlow and, of course, FastText.

A text can be interpreted from different perspectives among them let’s consider the words, the sentences, and the full document. In modern NLP – not gimmicks such keyword search – different methodologies consider those 3 dimensions when they try, for instance, to run a topic detection.

Word2vec treats each word in corpus like an atomic entity and generates a vector for each word. In this sense Word2vec is very similar to Glove – both treat words as the smallest unit to train on.

FastText – which is essentially an extension of word2vec model – treats each word as composed of character n-grams. So the vector for a word is made of the sum of this character n-grams. For example, the word vector “apple” is a sum of the vectors of the n-grams:

“<ap”, “app”, ”appl”, ”apple”, ”apple>”, “ppl”, “pple”, ”pple>”, “ple”, ”ple>”, ”le>”

(assuming hyperparameters for smallest ngram[minn] is 3 and largest ngram[maxn] is 6).

The key difference between FastText and Word2Vec is the use of n-grams. Word2Vec learns vectors only for complete words found in the training corpus. FastText, on the other hand, learns vectors for the n-grams that are found within each word, as well as each complete word. At each training step in FastText, the mean of the target word vector and its component n-gram vectors are used for training. The adjustment that is calculated from the error is then used uniformly to update each of the vectors that were combined to form the target. This adds a lot of additional computation to the training step. At each point, a word needs to sum and average its n-gram component parts. The trade-off is a set of word-vectors that contain embedded sub-word information. These vectors have been shown to be more accurate than Word2Vec vectors by a number of different measures

This difference manifests as follows.

  1. Generate better word embeddings for rare words ( even if words are rare their character n-grams are still shared with other words – hence the embeddings can still be good).
  2. This is simply because, in word2vec a rare word (e.g. 10 occurrences) has fewer neighbors to be tugged by, in comparison to a word that occurs 100 times – the latter has more neighbor context words and hence is tugged more often resulting in better word vectors.
  3. Out of vocabulary words – they can construct the vector for a word from its character n-grams even if a word doesn’t appear in training corpus. Both Word2vec and Glove can’t.
  4. From a practical usage standpoint, the choice of hyperparameters for generating FasText embeddings becomes key since the training is at character n-gram level, it takes longer to generate fastText embeddings compared to word2vec – the choice of hyperparameters controlling the minimum and maximum n-gram sizes has a direct bearing on this time.
  5. As the corpus size grows, the memory requirement grows too – the number of n-grams that get hashed into the same n-gram bucket would grow. So the choice of hyperparameter controlling the total hash buckets including the n-gram min and max size has a bearing. For example, even a 256GB RAM machine is insufficient (with swap space explicitly set very low to avoid swap) to create word vectors for a corpus with ~50 million unique vocab words with minn=3 and maxn=3 and min word count 7. The min word count had to be raised to 15 (thereby dropping a large number of words with the occurrence count less than 15) to generate word vectors.
  6. The usage of character embeddings (individual characters as opposed to n-grams) for downstream tasks have recently shown to boost the performance of those tasks compared to using word embeddings like word2vec or Glove.
  7. While the papers reporting these improvements tend to use character LSTMs to generate embeddings, they do not cite usage of FastText embeddings. It is perhaps worth considering FastText embeddings for these tasks since FasTtext embeddings generation (despite being slower than word2vec) is likely to be faster than LSTMs (this is just a hunch from just the time LSTMs take – needs to be validated. For instance, one test could be to compare FastText with minn=1, maxn=1 with a corresponding char LSTM and evaluate performance for a POS tagging task).

At Sandsiv we just released version 9.4 of our VOC HUB (Voice of the Customer Hub). FastText is one of the “engines” of our topic detection and sentiment analysis solution. It replaced the previous model based on n-grams “only”. If you are serious about topic detection and sentiment analysis, feel free to contact me.

Leave a Reply

Your email address will not be published. Required fields are marked *