In the past years, I spent most of my time exploring Natural Language Processing. In particular, I’m extremely interested to try to solve specific problems related to Customer Experience Management feedback analysis such:
1. Supervised and unsupervised topics detection
2. Text categorization
3. Sentiment analysis (at the end is a text categorization problem)
My research is mainly focused on Python. I don’t have enough Java background, for that reason probably it sounds a little Python biased. But there are several libraries available in Java as well. If some Java expert wants to list some good library in the comments, feel free to do that. We are all here to learn from each other 😉
First important things, I really focus on NLP as ‘real text mining’. I’m seeing around solutions sold as text mining while they are useless keywords search. Those solutions have nothing to do with Text Mining. Keyword search is something related to search engines, not text mining. If you want to get a deep explanation of the difference, you can read this article.
NLP is an exciting field of data science and artificial intelligence that deals with teaching computers how to extract meaning from text. It has nothing to do with creating tons of keyword search rules to find ‘topics’. I know that several American SaaS solution in the Customer Experience and Voice of the Customer area are selling keyword search as text mining, good for them and bad for their customers. Keyword search needs a big never-ending effort to maintain search rules and it is unable to find new topics. In few words, a big waste of money for useless insights.
In this overview, I will introduce some interesting Python libraries able to perform real NLP. These solutions handle a wide range of tasks such as tokenization, sentence extraction, part-of-speech (POS) tagging, topic detection, text clustering, sentiment analysis, document classification, topic modeling, and much more.
An important point: out there you can find more than 5 Python NLP libraries but the 5 I selected are probably the ‘must to know’. I am sure once you master them, you can really find around some other important library that best fit your particular issue.
Second important point. You can always combine the 5 libraries with some particular deep machine learning library (well, some of the 5 covers also that area). This particular topic is intentionally not covered by my post. However, there are fantastic opportunities to combine the 5 with libraries such: Tensorflow, Scikit-learn, Keras and many others.
The 5 kings are NLTK, spaCy, TextBlob, Standford Core NLP and Gensim.
NLTK, the mother of all Python NLP libraries
NLTK is ‘the mother’ of all Python NLP libraries. If you are a practitioner in the Python/NLP field, well NLTK is probably the first library you met: it is the most famous Python NLP library and it is led to incredible breakthroughs in the field. NLTK is the one that solved many text analysis problems, and no one can say the opposite.
On its own website, NLTK claims to be “a leading platform for building Python programs to work with human language data”. It is a perfect library for education and research. NLTK has over 50 corpora and lexicons, 9 stemmers, and dozens of algorithms to choose from. It is the kingdom for the NLP research.
The cons are: it is heavy and it has a steep learning curve. However, you will find a lot of resources out on the web to learn NLTK – as I said, it is probably the mother of all NLP Python packages. The other big cons: it’s slow and not production-ready.
spaCy, my little favorite
Once again, pay attention to my words because I am biased. I love SpaCy. First of all is European, second is damned cool! It is presented as an “industrial-strength” Python NLP library, fastest in the world, get things done and covers deep learning” and the nice thing: it is true!
If you are new in the NLP field I strongly suggest you go with NLTK first. Once you learn the basic, pass to SpaCy. You will first create a solid background, second, you will understand better what SpaCy does in a kind of black box. It is minimal and it will only present one algorithm (the best one) for each purpose. It really focuses on productivity.
SpaCy 2.0 – the latest version – covers several languages and it is delivered with prebuilt language models. It is light and fast because is built on Cython.
spaCy is the best way to prepare text for deep learning. It interoperates seamlessly with TensorFlow, PyTorch, Scikit-learn, Gensim and the rest of Python’s awesome AI ecosystem. With spaCy, you can easily construct linguistically sophisticated statistical models for a variety of NLP problems.
Its main weakness is new, so the community is not as large as NLTK. It is not possible to find a lot of resources out on the web… but is unbelievable cool, have a look!
TextBlob, simplify text processing
TextBlob is a Python (2 and 3) library for processing textual data. It provides a simple API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.TextBlob sits on the mighty shoulders of NLTK and another package called Pattern. I was not initially sure to include TextBlob: is it really an NLP library? Well, let’s say is probably not sophisticated as NLTK and spaCy, but I think yes, it is.
The big advantage of TextBlob is to seat on top of NLTK making text processing simple by providing an intuitive interface. It is an addition to the solid Python NLP libraries. It is easy to learn and offer a lot of nice features.
If you want to move your first step in NLP, probably TextBlob is the library where to start.
TextBlob is a simple, fun library that makes text analysis a joy. We’ll at least use TextBlob for initial prototyping for almost every NLP project.
Stanford CoreNLP, not really Python
Stanford CoreNLP provides a set of human language technology tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and syntactic dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get the quotes people said, etc.
“Standford CoreNLP is written in Java, not Python. You can get around this with Python wrappers made by the community.
Many organizations use CoreNLP for production implementations. It’s fast, accurate, and able to support several major languages.
Gensim, my first vector love
Once again I am biased: I spent several all-nighters with Gensim. A lot of pain and lot of joy. Gensim is not a ‘universal’ NLP packages. It does few things but it does them very well. It is definitely not at all-purpose general NLP.
What Gensim does well: topic modeling and document similarity analysis. Among the 5 Python NLP libraries listed here, it’s the most specialized.
I did my first Latent Dirichlet Allocation (LDA) implementation with Gensim and I turned text into vectors for the first time. It’s robust, efficient, and scalable. It is quite efficient – together with sci-kit learn, to word2vec, lda2vec, etc.
- NLTK is perfect for education and research. It is a must for learning and exploring NLP concepts. Not my personal choice for production.
- SpaCy is ‘the new NLP library’. It is designed to be fast, streamlined, and production-ready. Unfortunately not yet a big community… but stay tuned 😉
- TextBlob is built on top of NLTK, and it’s more easily accessible. It is good for prototyping or to build a solid solution. Probably not at the edge. A good library for beginners.
- Stanford’s CoreNLP is a Java library with Python wrappers. Due to its speed, the choice for fast production solutions.
- Gensim, if vectors are your world, then go for it. It is most commonly used for topic modeling and similarity detection. It’s not a general-purpose NLP library. As I mentioned before, it was my first love in NLP: it explained to me how to turn the words into vectors and how to use vectors in a 3D space… probably you need a quick intro on linear Algebra but that is the price you need to pay. You can combine it with sci-kit learn
Of course in our Voice of the Customer solution at Sandsiv we use several of those libraries in combination with deep machine learning solutions such, for instance, Tensorflow. The advantage is that you will work with them using a business-friendly easy to use Graphic User Web Interface. No need to code in any language, no need to be a data scientist. We created what I usually call ‘the iPhone of text Mining’. If you are interested, feel free to get in touch with me.