Text mining using vectors explained to business people

In the last years, a lot of researchers have invested time and resources into a new approach to Natural Language Processing: forget linguistic algorithms, let’s turn the words into mathematical vectors and plot them in a 3D space. Then, applying linear algebra, it allows us to perform specific tasks, such topic detection, sentiment analysis, etc. without even using a single code of pure linguistic algorithms. To nice to be true? Let’s discover the magic word of vectors using a very simple example.

One of the pioneers in this area of linguistic computation is Tomas Mikolov. Thomas was one of many extremely smart researchers at Google (Today is still a smart researcher but at Facebook AI working on FastText). His original idea was quite nice and simple: create a model to be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary. You can read the result of his research here.

How is possible that reducing words in vectors can solve typical business problems such: topic detection, sentiment analysis, new topics pop up, etc.?

Understanding word vectors

Before I start my explanation, please be aware the aim is to help business people better understand how word embedding¬†works. I will not open new academic discussions on word embedding. It is a very high-level simple explanation. I will use an example created by Allison Parrish. Allison is a poet but also a programmer. She uses Python in a very ‘art way’. You can find very interesting articles on her blog.

Let’s start considering a very small group of words: 14 animals. What we are going to do are essentially two steps:

  • Turn word (name of animals) into vectors
  • Plot those vectors on a space – in our case a two dimensions space.

Turn word into vectors

The way we will turn the word to vectors in our case is purely subjective. In reality, this task is a much complex operation considering several approaches such: frequency of certain words in the corpus, term frequency – inverse document frequency, etc. As I said, in our case, we will follow a purely subjective approach: we consider animals as words – of course – and we will turn those words into vectors using two subjective attributes:

  • the cuteness (0-100) of the animal, based on my pure subjective feeling
  • the size (0-100) of the animal, based on my pure subjective ignorance ūüėČ

The values themselves are simply based on my own judgment. Your taste in cuteness and evaluation of size may differ significantly from mine. My goal is, however, to explain you the word to vectors, not to push you to agree with my own subjective judgment ūüėČ

This is a very simplistic and subjective way to turn words into vectors. Based on the limited space of those 14 animals, our task is to be able to find similarities among these words. We start to list the animal with the two numerical elements into a table, something like this:

Dolphin, cuteness = 60, size = 45

Lobster, cuteness = 2, size = 15

Let’s have a look at the full table:

Anyway, as you can see, we have already created vectors out of words, for instance:

Dolphin = (60,45), it means that v dolphin = (60,45)

Lobster = (2,15), v lobster = (2,15)

Of course, those are really simple vectors. Usually, vectors generated in word2vec or similar algorithms have tons of dimensions and are usually notated as a matrix. In our example, we have just two coordinates and a very limited space in a cartesian chart… but we can still do a lot of interesting operations, don’t worry.

Plot those vectors on a space

Next step, we are going to plot the words as vectors in a space. In our case will be a limited space due to few words we want to use. Our table, in total, contains 14 words. The space we are going to create is limited by those 14 words. If we consider a bigger corpus, for instance, a collection of 10’000 customer feedback, we could potentially create a word space of 2’000-3’500 words. Much bigger than the 14 words of our example.

Despite having few words (animals) and a very small space, the values give us everything we need to make determinations about which animals are similar (at least, similar to the properties that we’ve subjectively included in the data). For instance, let’s try to answer the following question: Which animal is most similar to a capybara? You could go through the values one by one and do the math to make that evaluation, but visualizing the data as points in 2-dimensional space makes finding the answer very intuitive.

The plot shows us that the closest animal to the capybara is the panda bear (again, in terms of its subjective size and cuteness). One way of calculating how “far apart” two points are is to find their¬†Euclidean distance. This is simply the length of the line that connects the two points. The distance between “capybara” (70, 30) and “panda” (74, 40) is 11.18033 which is, for instance, less than the distance between “tarantula” and “elephant”: 104.00691.

Modeling animals in this way has a few other interesting properties. For example, you can pick an arbitrary point in “animal space” and then find the animal closest to that point. If you imagine an animal of size 25 and cuteness 30, you can easily look at the space to find the animal that most closely fits that description: the chicken.

Reasoning visually, you can also answer questions like what’s halfway between a chicken and an elephant? Simply draw a line from “elephant” to “chicken,” mark off the midpoint and find the closest animal. (According to our chart, halfway between an elephant and a chicken is a horse.)

You can also ask: what’s the¬†difference¬†between a hamster and a tarantula? According to our plot, it’s about seventy-five units of cute (and a few units of size).

The relationship of “difference” is an interesting one because it allows us to reason about¬†analogous¬†relationships. In the chart below, I’ve drawn an arrow from “tarantula” to “hamster” (in red).

You can understand this arrow as being the¬†relationship¬†between a tarantula and a hamster, in terms of their size and cuteness (i.e., hamsters and tarantulas are about the same size, but hamsters are much cuter). In the same diagram, I’ve also transposed this same arrow (this time in red) so that its origin point is “chicken.” The arrow ends closest to “kitten.” What we’ve discovered is that the animal that is about the same size as a chicken but much cuter is… a kitten. To put it in terms of an analogy: tarantulas are to hamster as chickens are to kittens.

Another interesting possibility reasoning with the Animal Space is related to clustering. By the way, it is also a very interesting approach to unsupervised learning to discover topics in a specific corpus. In the following picture, I did something really ‘stupid’: I designed a first round shape including chicken, lobster, tarantula, goldfish, and mosquito. Then, a second step, I transposed the exact round shape to group together other animals considering a specific point as the center of my round shape.

Doesn’t remember you something similar to clustering? If we consider our subjective attributes we can say we found 3 clusters in our dataset, and two outliers: elephant and crocodile. Imagine now to apply the same concept using a k-means clustering on a bigger dataset of customer feedback, for instance. The clustering would work perfectly to find main topics in the corpus. Of course, the level of sophistication should be much more than this simple example. As I said, you can use k-means clustering, or more sophisticated approach such, for instance, Latent Dirlecht¬†allocation.

A sequence of numbers used to identify a point is called a¬†vector, and the kind of math we’ve been doing so far is called¬†linear algebra.¬†(Linear algebra is surprisingly useful across many domains: It’s the same kind of math you might do to, e.g., simulate the velocity and acceleration of a sprite in a video game.)

A set of vectors that are all part of the same data set is often called a¬†vector space. The vector space of animals in this section has two¬†dimensions, by which I mean that each vector in the space has two numbers associated with it (i.e., two columns in the spreadsheet). The fact that this space has two dimensions just happens to make it easy to¬†visualize¬†the space by drawing a 2D plot. But most vector spaces you’ll work with will have more than two dimensions‚ÄĒsometimes many hundreds. In those cases, it’s more difficult to visualize the “space,” but the math works pretty much the same.

How is the power of vectors applied to NLP?

Now let’s work together with a much complex vector space: I have 10’000 customer feedback and I want to discover what are the main driver of satisfaction and dissatisfaction mentioned by my clients. The process will follow, more or less, exactly what we did with our vector space of animals, except an initial phase where we will ‘clean’ the text.

Clean the text

In order for our vector space to be effective, we will reduce all works in our corpus as tokens. For instance, the sentence ‘I like your product’ will be reduced to token ‘I’, ‘like’, ‘your’, ‘product’. It will make our life easier to ‘clean’ the corpus. We will also remove useless noisy words such ‘the’, ‘a’, ‘an’, ‘and’, etc. We will remove the punctuation as well such ‘.’, ‘;’, ‘!’, etc.

At this point, we will have a long dataset of tokens. One of the problems we will have with those tokens is the ambiguity. A token (word) can assume different meaning according to the context: ‘I made the program run’ and ‘I run’. The token run is exactly the same but has two different meaning. To avoid this ambiguity we can use a Part of Speech algorithm. An algorithm that will identify the two RUN as different part of the speech. We will then keep it separate for the rest of the analysis.

At this point, we will try to ‘summarize’ the token. In few words, we will reduce them grouping together reducing the noise. We will use a specific family of algorithms to do that: lemmatization. Lemmatization reduces names at a singular form, verbs at infinitive forms. etc. We will have then a long list of tokens (words). It will be a group similar to our 14 animals but, of course, bigger. Probably we will end with something like 3’000 tokens.

Now it is time to turn them into vectors. I will not go in details here. As I said before, there are a lot of methodologies to do that. If you are really interested, there are a couple of Python libraries able to do that. For instance, scikit-learn contains a specific TfidfVectorizer function able to turn tokens (words) to vectors, click here to get more details.

We will end up with a set of vectors and we will plot them on a 3D space. The difference between our animals example is the space: 2D vs 3D. Well, that is the level of sophistication we use to work in reality. The vectors are now complex matrix, not more two points in a 2D space. But the logic is similar, don’t worry.

Now we can apply exactly the same algebra operations we have applied to our animals. We can, for instance, clustering vectors together selecting a specific point in our 3D space. It will allow us to understand what kind of topics clients are talking about. Or we can combine them with a sentiment analysis and see the difference between the positive sentiment space and the negative one.

We can, as well, use the vectors to train a neural network able to categorize the feedback in specific topics or assign them multiple labels to define multi-topic inside the same sentence.

Conclusion

My aim in this post is to explain in very simple words how turning a corpus into vectors allowing Natural language Processing. Once words become vectors can be analyzed using linear algebra, machine learning, and deep machine learning algorithms. It is a new and fascinating word opening a lot of promising scenarios in Natural Language Processing. I hope you are now in a position to better understanding the power of vectors in NLP.

Leave a Reply

Your email address will not be published. Required fields are marked *