Why keywords search cannot be considered Text Mining
In my daily work I always get from my customers questions such: “Is it your text mining solution able to read specific words?”, “How does it deal with sarcasm?”, “How about if a word is misspelled or contains a typo?” and many others. Unfortunately most of people not familiar with text mining confuses keywords (or words) search with text mining. Those are two totally different words.
When you use keyword (word) search – basically searching words in a corpus – you type some words in a search engine and the software brings back one or more documents that contains those words. Each hit correspond to one document and typically you need to read a document to decide if is relevant or not. So, if you have a 1’000 hits, you need to read a 1’000 documents.
At the other end, text mining software is able to “read” and “interpret” the meaning of data inside the document. It identifies concepts and relationship. It presents the results back to you in a structure form. And the result are fragments of text that correspond to facts, associations or relationships. You only need to read the document once you find the relevant hit.
I personally think this confusion has been generated by vendors – especially in CXM space – setting wrong expectations. Common wrong expectations are:
- With a click of mouse you get all topics out of a big data set (corpus) of customer feedbacks.
- It is magic, you don’t need to do anything, the software itself will understand all topics, sentiment and correlations inside the text (corpus). Don’t believe in this bullshit even if the guy is telling to you is called Watson 😉
Unfortunately this is not the case even if – especially for the first point – we are not far from a solution. The hilarious fact is – as I said – many CXM vendors selling keyword search as text mining …and I can tell you, they put a very high price tag for that gimmick!
Why keyword search is not the right way to identify topics in a customer feedback corpus? As I said before “…typically you need to read a document to decide if is relevant or not.” It turns into a nightmare: the quality of your classification will be horrible, and maintaining the rules to identify topics by keywords will be an ever ending sad story.
With “real text mining”, especially deep machine learning, you will be able to achieve near the “push the button” unsupervised process to identify topics. The process, at very high level, will be as “easy” as 3 steps:
- “Clean” your corpus, using specific linguistic algorithms such: tokenisation, part-of-speech, stop words, disambiguation, lemmatisation, etc.
- Turn words in numbers (vectors) using different techniques: for instance continuous “bag of words”, “sparse matrix”, “tf-idf matrix”, etc. Reason to turn words into numbers is deep machine learning (neural networks) loves numbers.
- Apply specific deep machine learning techniques such: word2vec, Latent Dirichlet Assosciation (LDA), K-means clustering, etc. to automatically discover topics and sentiment. All those techniques are “code-able” in libraries such: Tensorflow, Gensim, Glove, Spy.cy, etc.
This approach allows you to answer to some of the specific requests mentioned before:
- The computer will not ‘read’ words, it will turn into vectors and understand them in a mathematical way (e.g. word2vec, lda2vec, etc.)
- The approach is language agnostic: it will analyse English the same way as Swiss-German, Arabic or Japanese …and I can assure you, this is a big advantage.
- The solution will solve the problem of misspelled words: the vector of “my father is coming home” or “my father is coming ome” is so close that there will be no difference as input for a neural network.
- New words and concepts popping up will not be a problem: new vectors will appear in the vector space, you will be able to identify them and follow them in a trend.
- What we call “multitopics” in the same sentence such “Your product is great but your customer care sucks!” can be easily detected by LDA models reporting also the right sentiment.
Easy as 1, 2 and 3 …well easy if you are working with the right partner 😉