Evolving a communications intelligence
Whispir is a cutting edge communications platform, used to deliver clear and compelling messages.
Whispir sends out millions of communications per day, including email, SMS, and WhatsApp messages.
In this article, we illustrate how we use automated text analysis to classify and group Whispir messages based on the content of the message.
The concept of using artificial intelligence (AI) to classify text has been around for a fair amount of time (think of automated filtering of spam in Gmail). This is a classic data science problem.
Our team liaised with stakeholders within the company (e.g. our solutions architects) and as a first step came up with 12 message types based on the types of messages that are sent within Australia. These include messages relating to use cases like (but not limited to) medical appointment reminders, bushfire warnings, shift time updates, and marketing messages.
This allows for greater internal visibility of the types of messages that we send out and provides helpful information that can be used in the design of future intelligent product features.
This project was a proof-of-concept as part of our vision in evolving our platform into a genuine communications intelligence. Therefore, this article will focus highly on the workflow and tools used to generate and train our message type predictor model and classify our data.
Finding a common language
Machine learning is a branch of AI which focuses on creating algorithms that enable computers to learn tasks based on examples and/or improve over time without being programmed to do so.
Computers are great at working with structured data such as numbers, spreadsheets and database tables. But our messages are conveyed using words and human languages, not in numbers such as binary code.
So, we need to break our messages down into a format that our computer (aka “machine”) can read. And this is where things get interesting.
To do this, we use a branch of machine learning known as Natural Language Processing or “NLP”.
In particular, we use a type of machine learning known as supervised learning, where the problem has predefined training data. This data is labelled by humans and coded into the machine learning model so that it can learn from these predefined labels in its native ‘language’. After being fed several examples, the model learns to differentiate types of messages and start making associations as well as its own predictions.
To obtain good levels of accuracy, we need to train our model on a large number of examples that are representative of the everyday subset of the messages that we send out.
Creating our training data
The first step in training a classifier model is uploading a set of examples and tagging them manually.
To ensure the security and privacy of all contacts, any personally identifiable information (PII) is scrubbed out of the messages using Amazon’s Comprehend tool.
Using a computer-generated, randomised sample of 5,000 de-identified messages taken from July to December 2020, a team of Whispir staff manually labelled the messages to ensure confidentiality.
The team came up with 12 distinct categories based on the content of these messages that largely reflect our suite of use-cases within the Australian market (https://www.whispir.com.au/industries). Some examples of common use-cases at Whispir are:
- Booking confirmations
- Appointment reminders
- Delivery tracking
- Marketing campaigns
- Customer support
- Emergency alerts
- Two-factor authentication
- Employer support like rostering
- Transport alerts
First, we need to be able to access our data that is stored securely within Amazon Web Services’ (AWS) simple storage solution (S3). To do this we used boto3.
Second, we clean, filter, and create a simplified version of our text data using SpaCy.
Third, we vectorize the text and save the embeddings for future use using Scikit-Learn.
Next, we investigate the type of model to use using Scikit-Learn.
Finally, we save our model for inference using pickle.
The workflow in detail
Download dataset from S3
We needed to access our data stored within S3 through AWS for data security and privacy reasons. Boto3 is the name of the Python software development kit (SDK) for AWS. It makes things much easier to work with.
First, you need an AWS account and to set up access credentials. You can follow the steps here to read more about that.
Clean and tokenize the text using SpaCy
SpaCy is an open-source natural language processing library for Python.
It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently. We wrote our own custom tokenizer function using SpaCy and applied it to our messages.
Whispir lexicon: Tokenization
In Natural Language Processing (NLP), tokenization is the process of converting a sequence of characters, words, or subwords into a sequence of smaller units called tokens. This can be breaking a paragraph into sentences, sentences into words, or words into characters.
In short, the text was tokenized, lowercased and lemmatised. It had punctuation, numbers and stop words removed. The contractions were also expanded out. The output looks something like this:
Whispir lexicon: Lemmatised
Lemmatization can be thought of as word normalisation, it’s a way to try to reduce a given word to its meaningful base or dictionary form, known as a lemma. SpaCy is clever in that it will convert a word to lowercase and change past tense to present tense. It then determines the part-of-speech tag by default and assigns the corresponding lemma. For example, “runs”, “running”, “ran” will all be converted to the lemma “run”.
Turning our text into numbers
We needed to create features from the text. To do this, we needed to turn the words into numbers because machines read numbers like we read our native language characters on a page. The features are created from processed text, not raw text.
There are a few ways to turn words into numbers.
We chose for each word or “term” in our messages to calculate a measure called Term Frequency - Inverse Document Frequency (TF-IDF).
Whispir lexicon: TF-IDF
TF-IDF is a measure of the originality of a word by comparing the number of times a word appears in a document with the number of documents the word appears in. We used sklearn.feature_extraction.text.TfidfVectorizer to calculate a tf-idf vector for each of our messages.
After having this vector of representations of the text we can train supervised classifiers to predict the message type of unseen messages.
We were now ready to experiment with different machine learning models, evaluate their
accuracy and find the source of any potential issues.
Models for investigation:
- Logistic Regression
- (Multinomial) Naive Bayes
- Linear Support Vector Machine
- Random Forest
To evaluate the performance of our models, we often look at accuracy, precision, recall, F1 score, and related indicators.
Whispir lexicon: Measuring model performance
When you build a model for a classification problem you almost always want to look at the accuracy of that model as the number of correct predictions from all predictions made. However, accuracy alone is not a great measure to make this decision.
There are a number of other metrics that we look at including: precision, which is the number of correct positive results divided by the number of positive results predicted by the classifier; recall, which is the number of correct positive results divided by the number of all relevant samples; and F1 score, which is the mean of precision and recall.
However, you can’t evaluate the predictive performance of a model with the same data you used for training.
You need to evaluate the model with fresh data that hasn’t been seen by the model before. You can accomplish that by splitting your dataset before you use it.
So, we used the train-test split procedure sklearn.model_selection.train_test_split to estimate the performance of our machine learning models. This subsets our dataset so we can get an unbiased evaluation of our model.
Once our data is split into training, validation and test datasets, we look to the performance of each model:
The accuracy of our best performing model is Linear support vector classifier (SVC) at 79.83%. That’s the fraction of predictions our model got right.
Whispir lexicon: Linear SVC
A linear SVC (Support Vector Classifier) constructs a “best fit” hyper-plane or a set of hyperplanes that can then be used to divide or categorise your data. Once you have the hyperplane, you can then feed some new data to your classifier to see what the "predicted" class is.
Our best performing model was LinearSVC from sklearn.svm import LinearSVC.
We then looked at the confusion matrix, and show the discrepancies between predicted and actual labels:
Whispir lexicon: Confusion matrix
In the field of machine learning and specifically the problem of statistical classification, a confusion matrix, also known as an error matrix, is a specific table layout that allows visualisation of the performance of an algorithm. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.
The vast majority of the predictions end up on the diagonal (predicted label = actual label), where we want them to be. However, there are a number of misclassifications, and we need to check those and see what was happening.
Check the misclassifications
Some of the misclassified messages were those that touch on more than one subject. For example, a medical appointment reminder also contains information similar to that of emergency messages (e.g., COVID-19 warnings) due to the medical terminology used within both message types.
This sort of error will always happen. We chose an approach where we decide which class we think is more appropriate. In this case we examined the top three worst-performing categories and manually re-labelled those messages into categories our team deemed most appropriate. Another approach could be to perform multi-label classification where a message could be classed as both a medical message and an emergency message, but this would involve re-training our model using a slightly different approach.
After cleaning and checking our labels, we then took another look at our performance. It now had 84.37% accuracy.
Given these encouraging results, we decided to go ahead and save this trained classifier model so that it can be tested and potentially used to predict and label our outgoing messages in real-time.
Pickling and saving the model for later use
Pickle is the standard way of serializing objects in Python. You can use the pickle.dump() operation to serialize your machine learning algorithms and save the serialized format to a file. You can then load this saved and trained model at a later date using pickle.load() to calculate the accuracy score and predict outcomes on new unseen (test) data.
At this point, we have trained a model that will be able to classify new messages sent through the Whispir platform into message types. Machine learning model training, however, is only one small step in an end-to-end machine learning project.
There are many steps that still need to be undertaken before this kind of data can be used within Whispir. Machine learning pipelines are iterative as every step is repeated continuously to improve and ensure the accuracy of the model. There are also several other considerations like data governance and data ethics that need to be carefully reviewed in relation to any machine learning project.