Your browser is unable to display this site correctly. Please try an up-to-date version of Chrome or Firefox instead.

< Back to all posts

Deep Learning on Text Data

Benny Cheung

By Benny Cheung

Senior Technical Architect

View bio
January 24, 2019

Deep Learning on Text Data

A large quantity of human communication is comprised of text written in natural language. Recent advances in the field of Machine Learning have shown that meaningful knowledge can be extracted effectively. Once the general techniques of Natural Language Processing is combined with Machine Learning, a wide-range of enterprise applications become practical.

Dream of Text

Figure 1. Text data projected onto Van Gogh's Starry Night painting, as an analogy to the dream of finding patterns out of deceptive chaos.

What we want to learn is how to apply the Deep Learning technique to analyzing text data. Not surprisingly, businesses have an abundance of text data, usually in unstructured form, like emails, comments, or documents. Machine Learning engineers dream of finding patterns in all of this chaos, and to this end have developed Natural Language Processing (NLP) techniques to do so.

In this article, we shall start with a look at the theory behind NLP as applied to ML. Then, we will try out these techniques on Twitter's sentiment analysis problem, applying Deep Learning to model and predict the sentiment behind tweets.

This is a long article that covers:


Sentiment Analysis with Machine Learning

Sentiment Analysis is a pretty interesting problem in the NLP space. Whenever an email arrives in the customer service inbox, the business wants to be able to identify the customer's sentiment, and in the case that it is negative, we want to send it to the customer service folder for attention; otherwise we want to send it to the happy feedback inbox. One way to solve this problem is to hand-craft a complex set of hard-coded rules and then write a program that will check whether these rules are satisfied.

The static rule could be like a check for whether specific keywords, pre-determined to be associated with negative emotions, occur in the email or not. If those keywords were present, then the program would direct the message to the customer service folder. This kind of approach is perfectly alright with some businesses and cases, but given the complexity of natural language and the type of scenario, even human experts may find it difficult to articulate a rule that would apply all the time.

If a business has access to a large number of historical emails (or other representative body of text) available, and if the patterns or relationships that the business are trying to model are continuously changing, then using a Machine Learning approach will be a better option than a rule based approach.

Within an ML approach, we would design part of the e-mail ingest system to run the message through a trained model which checks for the sentiment of the email. The difference here is that the rules, which are used to check whether an email is positive or negative in sentiment, are dependent on historical emails.

In other words, an ML algorithm would basically look at historical emails to derive the rules and it would keep updating the rules as the business accumulates a larger body of emails to learn from. The historical emails, which have already been marked as positive and negative sentiment, embed within them a lot of information which indicate what marks an email as having positive or negative sentiment. The ML algorithm is able to look at the historical emails, and infer what those rules are by itself, and use those rules to check against any new email; subsequently, classifying it as positive or negative correctly.

Types of Machine Learning in NLP

There are two approaches that we could take while solving any NLP problem:

The rule-based approach would involve the programmer either knowing what the rules are or empirically identifying the rules by using data analytics and exploration.

In the machine learning approach, the rules will be identified by an ML algorithm. But there is a workflow that needs to be set up first.

Machine learning problems generally fall under a specific set of categories. You would start by identifying which type of problem or which category the problem you are trying to solve falls into. Once the problem's category has been identified, the data needs to be transformed and represented by numeric attributes. Then, we would apply a standard ML algorithm on those numeric data.

As previously mentioned, machine learning problems generally fall under a broad set of categories: classification clustering, recommendations, and regression.

Since the email sentiment problem that we are trying to solve belongs to classification, we concentrate the investigation on classification clustering.

Types of Machine Learning in NLP

Figure 2. Showing the key differences between clustering and classification in machine learning.

Classification

Classification is one of the most common type of NLP tasks that happens in machine learning. For example, performing sentiment analysis to classify a piece of text as positive or negative sentiment or classifying a bunch of articles into one of a different set of genres. For instance, an email or a tweet could be the target problem instance.

We need to assign a category or a label to the problem instance. In the sentiment analysis problem, the labels are positive (numerically as 1) and negative (numerically as 0).

ML algorithms of this type are known as Classifiers. In order to use a Classifier, there is a prerequisite; that there is a set of instances for which the correct category membership is known. In our example, we need a set of tweets which are already classified as positive or negative. This set of tweets is called training data and it is from this data that the Classifier infers the rules or patterns that actually help to classify a new tweets as positive or negative.

Clustering

Another class of machine learning problems, which are pretty common in NLP is Clustering. For example, a set of articles needs to be associated with the themes or topics occurring in the content. Articles that have some attributes in common are identified as relating to each other. This differs from classification, in that the article's groupings are unknown before we start.

So using the clustering algorithm, the large set of articles is clustered into smaller groups. Once we have the clusters, these groups have some shared attributes and meaning representing a particular theme or topic. Clustering is usually used to just explore the data, and to identify patterns within that set.

With these identified patterns as guidance, we can determine what is the best machine learning workflow to be applied.

Mechanics of Machine Learning on Text Data

After the type of problem has been identified, the next step is to calculate numeric attributes from the text. These numeric values are used to represent each piece of text. Since the ML algorithms only take numeric data as input, the text needs to be converted into numeric representations. The numeric attributes are called features. There are ways in which the text can be transformed into these numeric attributes.

One example is called "term frequency representation", which looks at the frequency with which words occur in the text. Another example is "term frequency inverse-document frequency". We'll look at both of these methods as we continue through the example.

Once the data is read in a numeric form, you can take a standard algorithm and apply it to the data. The job of the algorithm is to find patterns from historical data.

The rules that the ML algorithm identifies, are referred together as a Model. It can be something like a mathematical equation or it could be a set of rules which are represented as if-then-else statements. In terms of neural networks, it is the set of neuron weights in the network.

  • For a classification problem, algorithms like the Naive Bayes, the Support Vector Machines or Deep Learning can be applied.
  • Alternatively, for a clustering problem, algorithms like K-means Clustering or Hierarchical Clustering can be chosen.

Twitter Sentiment Analysis

Applying the essential ML workflow on text data looks like this:

  1. Prepare Data - using tokenization, stopword removal, word sense disambiguation, etc.
  2. Train Model - using unsupervised clustering, supervised classification ML algorithms.
  3. Evaluate Model - measuring the model performance to continue improving the model.

Prepare Data

The text data is usually run through a series of pre-processing tasks, to take a large piece of text and then break it down into smaller and meaningful components. Some combination of the following tasks are applied depending on the data set requirements,

  • Tokenization - breaking down individual sentences or individual words themselves are the process called tokenization.
  • Stopwords - isolating the individual words or tokens in the text, filtering out words that do not add information or meaning to the text, and others which are present solely to add structure or for grammatical purposes. These are called stopwords, and could include words such as "is" and "the". They can be filtered out before further processing of the text.
  • N-Gram - identifying what are the most commonly occurring words in the text because these are the most important words in the text and finding groups of words that occur together are the task of n-gram, would extract more meaning out of the text.
  • Stemming - extracting only the root of the word, for example 'closed' becomes 'close' because they have the same meaning, is called stemming.

Text Pre-processing Workflow

Figure 3. An example of text pre-processing workflow and the processing results

Techniques to convert textual data into other representations are:

  • Disambiguation - identifying the meaning of the word based on the context in which it occurs. The sentence is parsed on identifying which part of speech, whether a word is a noun, or an adverb, etc.
  • Word2Vec - vectorized word into multi-dimensional space based on their context and meaning within the group of text.

Prepare Tweets Data

The tweets data can be downloaded from The Twitter Sentiment Analysis Dataset. It's a csv file that contains 1.6 million rows. Each row has amongst other things, the text of the tweet and the corresponding sentiment. Each tweet is marked as 1 for positive sentiment and 0 for negative sentiment.

The original downloaded sentiment tweets needs some preprocessing to clean up the csv file. In particular, the SentimentText column needs to be quoted properly. The following script will surround the original tweets with quotes, and clean up quotes within the outer quote, to make the data format consumable by the Pandas's csv parser to construct the dataframe.

fname = "tweets_download.csv"
with open(fname) as f:
    content = f.readlines()

content = [x.strip().split(',', 3) for x in content]
f.close()

fname = "tweets.csv"
with open(fname, 'w') as f:
    f.write('"ItemID","Sentiment","SentimentSource","SentimentText"\n')
    for x in content[1:]:
        f.write('"' + x[0] + '"')
        f.write(',')
        f.write('"' + x[1] + '"')
        f.write(',')
        f.write('"' + x[2] + '"')
        f.write(',')
        y = x[3].replace('"', '')
        f.write('"' + y + '"')
        f.write('\n')

f.close()

After cleaning up the " quote problems, we can load the tweets from the result tweets.csv into a Pandas dataframe.

import pandas as pd

tweets = pd.read_csv("tweets.csv")

The dataframe tweets can be inspected for deeper insight into the tweet data, for example, starting with displaying the first 10 rows, and more.

Tweets Data Frame

Tweet Text Preprocessing

We want to break each tweet into meaningful words.

The tokenizer_tweet() function, calling tokenizer(), is for the tweet's tokenization, stopwords and punctuation removal. The tweet's specific filtering is isolated in tokenizer_tweet(), while keeping tokenizer() generic for any text data for potential reuse.

import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from string import punctuation

# list of stopwords like articles, prepositions
stop = set(stopwords.words('english'))

def tokenizer(text):
    try:
        tokens_ = [word_tokenize(sent) for sent in sent_tokenize(text)]

        tokens = []
        for token_by_sent in tokens_:
            tokens += token_by_sent

        tokens = list(filter(lambda t: t.lower() not in stop, tokens))
        tokens = list(filter(lambda t: t not in punctuation, tokens))

        filtered_tokens = []
        for token in tokens:
            if re.search('[a-zA-Z]', token):
                filtered_tokens.append(token)

        filtered_tokens = list(map(lambda token: token.lower(), filtered_tokens))

        return filtered_tokens
    except Exception as e:
        print(e)

def tokenizer_tweet(tweet):
    tweet = unicode(tweet.decode('utf-8').lower())
    tokens = tokenizer(tweet)
    tokens = filter(lambda t: not t.startswith('@'), tokens)
    tokens = filter(lambda t: not t.startswith('#'), tokens)
    tokens = filter(lambda t: not t.startswith('http'), tokens)
    return tokens

Discover Important Words

The exploration never stops after the text pre-processing.

With the tokenized words, we must proceed to answer the question of what are the important words for the problem domain? First, we present a typical text analytic step using tf-idf (term frequency-inverse document frequency) to find the important words. The effort to understand tf-idf concept is still important for the text analysis in general; however, we shall show the reason why tf-idf words do not fit in the sentiment analysis.

Then, we show that the simple word counting is sufficient for identifying the tweet important words.

(1) By tf-idf

To analyze a corpus of text, we like to know what the important words are. tf-idf stands for term frequency-inverse document frequency. It's a numerical statistic intended to reflect how important a word is to a document or a corpus (i.e. a collection of documents).

To relate to this post, words correspond to tokens and documents correspond to descriptions. A corpus is therefore a collection of descriptions.

The tf-idf of a term t in a document d is proportional to the number of times the word t appears in the document d but is also offset by the frequency of the term t in the collection of the documents of the corpus. This helps adjusting the fact that some words appear more frequently in general and don't especially carry a meaning.

tf-idf acts therefore as a weighting scheme to extract relevant words in a document.

$$tfidf(t,d) = tf(t,d) . idf(t) $$

$$tf(t,d)$$ is the term frequency of t in the document d (i.e. how many times the token t appears in the description d)

$$idf(t)$$ is the inverse document frequency of the term t. it's computed by this formula:

$$idf(t) = log(1 + \frac{1 + n_d}{1 + df(d,t)}) $$

  • $$n_d$$ is the number of documents
  • $$df(d,t)$$ is the number of documents (or descriptions) containing the term t

Computing the tfidf matrix is done using the TfidfVectorizer() method from scikit-learn. Let's see how to do this:

from sklearn.feature_extraction.text import TfidfVectorizer

# min_df is minimum number of documents that contain a term t
# max_features is maximum number of unique tokens (across documents) that we'd consider
# TfidfVectorizer preprocesses the descriptions using the tokenizer we defined above
vectorizer = TfidfVectorizer(min_df=10, max_features=10000, tokenizer=tokenizer, ngram_range=(1, 2))
vz = vectorizer.fit_transform(list(tweets['SentimentText']))

vz is a tfidf matrix.

  • its number of rows is the total number of documents (descriptions)
  • its number of columns is the total number of unique terms (tokens) across the documents (descriptions)

$$x{dt} = tfidf(t,d)$$ where $$x{dt}$$ is the element at the index (d,t) in the matrix.

Let's create a dictionary mapping the tokens to their tfidf values

tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
tfidf = pd.DataFrame(columns=['tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['tfidf']

We can visualize the distribution of the tfidf scores by histogram

tfidf.tfidf.hist(bins=50, figsize=(15,7))

tf-idf Histogram Distribution

Figure 4. tf-idf scores histogram, high tf-idf score means more discriminated word; otherwise low tf-idf means less discriminated word.

We tried to see the tf-idf score for some of the obvious sentimental words. Not surprisingly, these words are not very discriminated because they are used very often within the corpus.

tfidf.loc[['sorry','happy','sad']]
tfidf
sorry 5.710631
happy 5.092447
sad 4.721503

Some of the top discriminated words can be found by,

tfidf.sort_values(by=['tfidf'], ascending=False).head(3)
tfidf
window 7.812545
maths 7.812545
grandma 7.812545

This tell us that {window, maths, grandma} are the words seldom used in the tweets. If a tweet contains these words, that shows they are a lot more special than the other words. But sentiment analysis is not concerned with locating highly discriminated words; instead, words that are commonly used and recognized as sentimental should be ranked higher.

(2) By Counting

Failing to use the more sophisticated and popular tf-idf technique to discover the important words, we try the mundane technique of word counting.

import collections

counter = collections.Counter()
maxlen = 0
for index, tweet in tweets.iterrows():
    try:
        words = [x.lower() for x in tokenizer_tweet(tweet["SentimentText"])]
        if len(words) > maxlen:
            maxlen = len(words)
        for word in words:
            counter[word] += 1
    except:
        pass

Surprisingly, just by inspecting the top 10 most commonly used words in the tweets, we discovered that the sentimental words are among the most commonly used words.

counter.most_common(10)

[(u"'s", 179509),
 (u"n't", 173837),
 (u"'m", 130868),
 (u'good', 88554),
 (u'day', 82165),
 (u'get', 81338),
 (u'like', 77753),
 (u'go', 72547),
 (u'quot', 71387),
 (u'got', 69689)]

We settle for a simple counting technique to build our vocabulary for machine learning. Our vocabulary will be the top common words, with VOCAB_SIZE (=5000) words from the tweets.

As a side note, we reserved the word index 0 to unknown _UNK_. For any other words that does not appear in our vocabulary, we shall assign those words as unknown. Deep Learning effectively ignores any unknown words as anonymous _UNK_.

VOCAB_SIZE = 5000

word2index = collections.defaultdict(int)
for wid, word in enumerate(counter.most_common(VOCAB_SIZE)):
    word2index[word[0]] = wid + 1
vocab_sz = len(word2index) + 1
index2word = {v:k for k, v in word2index.items()}
index2word[0] = "_UNK_"

Converting Word to Numeric Representation by word2vec

Only invented recently, the advanced technique of word2vec is used to convert words into numeric representation. word2vec is a group of Deep Learning models developed by Google with the aim of capturing the context of words while at the same time proposing a very efficient way of preprocessing raw text data. This model takes as input a large corpus of documents like tweets or news articles and generates a vector space of typically several hundred dimensions. Each word in the corpus is being assigned a unique vector in the vector space.

For example, as shown in the following figure from Tomas Mikolov's presentation at NIPS 2013, vectors connecting words that have similar meanings but opposite genders are approximately parallel in the reduced 2D space, and we can often get very intuitive results by doing arithmetic with the word vectors.

This presentation provides many other examples.

word2vec Vector Space Proximity Example

Figure 5. (left) Illustrates word2vec vectors that has close proximity to similar meaning words. (right) Illustrates word2vec vectors detect multiple features on different levels of abstractions.

For our example, we download the word2vec GoogleNews Vectors as a starting point. This data has been pre-trained with Google News corpus (3 billion words) word vector model (containing 3 million 300-dimension English word vectors).

After downloading the word2vec data, we can load it with gensim.models.KeyedVectors.

from gensim.models import KeyedVectors

WORD2VEC_MODEL = "data/GoogleNews-vectors-negative300.bin.gz"
word2vec = KeyedVectors.load_word2vec_format(WORD2VEC_MODEL, binary=True)

To gain some intuitive idea about word2vec, for example, the word man vectorized as a 300 dimension point as shown.

print word2vec['man']

[ 0.32617188  0.13085938  0.03466797 -0.08300781  0.08984375 -0.04125977
 -0.19824219  0.00689697  0.14355469  0.0019455   0.02880859 -0.25
 -0.08398438 -0.15136719 -0.10205078  0.04077148 -0.09765625  0.05932617
  0.02978516 -0.10058594 -0.13085938  0.001297    0.02612305 -0.27148438
  0.06396484 -0.19140625 -0.078125    0.25976562  0.375      -0.04541016
  ...
  0.07177734  0.13964844  0.15527344 -0.03125    -0.20214844 -0.12988281
 -0.10058594 -0.06396484 -0.08349609 -0.30273438 -0.08007812  0.02099609 ]

The powerful concept behind word2vec is that word vectors that are close to each other in the vector space represent words that are not only of the same meaning but of the same context as well. When we are looking for the most similar words, word2vec yields the following result.

word2vec.most_similar('man')

[(u'woman', 0.7664012908935547),
 (u'boy', 0.6824870705604553),
 (u'teenager', 0.6586930751800537),
 (u'teenage_girl', 0.6147903800010681),
 (u'girl', 0.5921714901924133),
 (u'suspected_purse_snatcher', 0.571636438369751),
 (u'robber', 0.5585119128227234),
 (u'Robbery_suspect', 0.5584409236907959),
 (u'teen_ager', 0.5549196600914001),
 (u'men', 0.5489763021469116)]

What we find interesting about the vector representation of words is that it automatically embeds several features that normally have to be handcrafted.

Since word2vec relies on deep neural network to detect patterns, we can also rely on word2vec to detect multiple features on different levels of abstractions.

notice the funny thing about "man" is most similar to "robber" and "Robbery_suspect". This must be influenced by the set of news sources where the word2vec is constructed.

For each word index, the corresponding word2vec vector is assigned into an embbeding_weights matrix. This matrix will be used for converting words into word2vec representation later.

import numpy as np

embedding_weights = np.zeros((vocab_sz, EMBED_SIZE))
for word, index in word2index.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

For example, if the word man is in vocabulary word2index['man']=123, then the vectorized word man is represented by embbedding_weights[123].

Train Model with Keras

To keep focus, we shall not describe explain the keras deep learning framework. Interested readers should consult the book by Antonio Gulli & Sujit Pal, Deep Learning with Keras.

Prepare Data Set

Even more data preparation! We need to reformat the tweets into the suitable vector format for the keras layers to consume.

The following vectors are constructed,

  • X is the input tweet word vectors
  • Y is the sentiment label, 0 is negative and 1 is positive sentiment
  • Text is the tweet "SentimentText" (we shall use this vector for displaying the tweet, along with the prediction results later)
from keras.preprocessing.sequence import pad_sequences
from keras.utils import np_utils

ts = []
xs = []
ys = []
for index, tweet in tweets.iterrows():
    try:
        ts.append(tweet["SentimentText"])
        ys.append(tweet["Sentiment"])
        words = [x.lower() for x in tokenizer_tweet(tweet["SentimentText"])]
        wids = [word2index[word] for word in words]
        xs.append(wids)
    except:
        print tweet

Text = ts
X = pad_sequences(xs, maxlen=maxlen)
Y = np_utils.to_categorical(ys)

Split Data Set

The usual data splitting recommendation is 70/30 of the training/testing data set. We use the sklearn.model_selection utility train_test_split() function to randomize the splitting. Notice the parameter test_size=0.3; this means 30% will be allocated to the testing data set. We set an initial random_state=42 to ensure reproducibility of the data set, to support multiple runs with the same splitting sequence.

from sklearn.model_selection import train_test_split

Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y,
                                                test_size=0.3,
                                                random_state=42)
# perform the exactly same splitting for the corresponding Text vector
Xtrain, Xtest, Ttrain, Ttest = train_test_split(X, Text,
                                                test_size=0.3,
                                                random_state=42)

Deep Learning

We define and train the Deep Learning neural network with keras.

(1) Define Deep Neural Network

We initialize the weights of the embedding layer with the embedding_weights matrix that we built in the previous section.

The model is compiled with the binary-crossentropy loss function (because we only have 2 classes) and the adam optimizer.

from keras.layers.core import Dense, Dropout
from keras.layers.convolutional import Conv1D
from keras.layers.embeddings import Embedding
from keras.layers.pooling import GlobalMaxPooling1D
from keras.models import Sequential
from keras import regularizers

EMBED_SIZE = 300
NUM_FILTERS = 128
NUM_WORDS = 3

model = Sequential()
model.add(Embedding(vocab_sz, EMBED_SIZE, input_length=maxlen,
                    weights=[embedding_weights],
                    embeddings_regularizer=regularizers.l2(0.01),
                    trainable=True))
model.add(Dropout(0.5))
model.add(Conv1D(filters=NUM_FILTERS, kernel_size=NUM_WORDS,
                 activation="relu"))
model.add(Dropout(0.5))
model.add(GlobalMaxPooling1D())
model.add(Dense(sentiment_len, activation="sigmoid"))

model.compile(optimizer="adam", loss="binary_crossentropy",
              metrics=["accuracy"])

The Deep Learning network is summarized by the summary() function.

model.summary()

Tweets Keras Network Summary

(2) Train Deep Neural Network

We shall invoke the network training by model.fit(), and train the network with batch size 16 and for 20 epochs, then evaluate the trained model.

BATCH_SIZE = 16
NUM_EPOCHS = 20
history = model.fit(Xtrain, Ytrain, batch_size=BATCH_SIZE,
                    epochs=NUM_EPOCHS,
                    validation_data=(Xtest, Ytest))

After 20 epochs of training, the accuracy is converged to ~72%.

Train on 1104929 samples, validate on 473542 samples
Epoch 1/20
1104929/1104929 [==============================] - 1966s 2ms/step - loss: 1.4675 - acc: 0.7124 - val_loss: 0.8175 - val_acc: 0.7212
...
Epoch 20/20
1104929/1104929 [==============================] - 1941s 2ms/step - loss: 0.8374 - acc: 0.7174 - val_loss: 0.8260 - val_acc: 0.7221

Tweets Training Accuracy Chart

Figure 6. The chart show the training accuracy and validation accuracy. The training accuracy has converged quickly.

(3) evaluate Trained Model

The model gives us an accuracy of 72.2% on the test set after 20 epochs of training.

score = model.evaluate(Xtest, Ytest, verbose=1)
print("Test score: {:.3f}, accuracy: {:.3f}".format(score[0], score[1]))

473542/473542 [==============================] - 85s 180us/step
Test score: 0.826, accuracy: 0.722

Evaluate Model for Further Improvements

The model prediction accuracy is good but not super. Since keras can return the prediction vector for each classification, in our case there are 2 classes (0 negative and 1 positive sentiment), we can perform further analysis on comparing the actual vs prediction results; hopefully, the analysis can reveal clues on how to improve the model.

Ypredict = model.predict(Xtest, batch_size=32, verbose=1)
print Ypredict[:5]

We can see the prediction probability vector, for each row (a tweet) and column (probability for each class). The total probabilities for each row should be 1.

array([[ 0.25863048,  0.74136955],
       [ 0.69782579,  0.30217424],
       [ 0.20225964,  0.79774034],
       [ 0.16227043,  0.83772963],
       [ 0.74099219,  0.25900784]], dtype=float32)

(0) Find the Corresponding Actual and Predicted Sentiment

This is a data preparation step to construct vectors to support analysis. We can collect the actual_sentiment label and the predict_sentiment label for each tweet. For the sentiment classification, selecting the index with the highest probability, which will yield the predicted label.

actual_sentiment_idx = np.argmax(Ytest, axis=1)
actual_sentiment = [id2sentiment[k] for k in actual_sentiment_idx]
predict_sentiment_idx = np.argmax(Ypredict, axis=1)
predict_sentiment = [id2sentiment[k] for k in predict_sentiment_idx]
predict_probs = np.array([Ypredict[k][predict_sentiment_idx[k]] for k in np.arange(len(predict_sentiment_idx))])

(1) Review Correct Predicted Labels at Random

We can review the correct predicted labels at random, and print out with the tweet.

correct = np.where(predict_sentiment_idx==actual_sentiment_idx)[0]
print("Found %d correct labels of %d" % (len(correct), len(predict_sentiment_idx)))
idx = permutation(correct)[:5]
for i in idx:
    print("[%d] predict '%s' -> actual '%s': '%s'" %(i, predict_sentiment[i], actual_sentiment[i], Ttest[i]))

The results show the prediction along with the tweets.

Found 341950 correct labels of 473542
[61746] predict '0' -> actual '0': 'I'm off to work, be back around 11pm  I hate long work days'
[25913] predict '1' -> actual '1': 'Jus woke up!!!!!!! Crazy night'
[85662] predict '1' -> actual '1': '&quot;Kids&quot; on repeat. Absolutely amazing! So fatigued right now but we're leaving 4 Norway today'
[202167] predict '0' -> actual '0': '@RoxygirlSLB nah didn't say that but she probably would have one! Hehe I want a black one, but they don't do them in UK'
[165979] predict '1' -> actual '1': 'Aww! Zack &amp; Vanessa are so cute together.'

(2) Review Incorrect Predicted Labels at Random

We can review the incorrect predicted labels at random, and print out with the tweet.

incorrect = np.where(predict_sentiment_idx!=actual_sentiment_idx)[0]
print("Found %d incorrect labels of %d" % (len(incorrect), len(predict_sentiment_idx)))
idx = permutation(incorrect)[:5]
for i in idx:
    print("[%d] predict '%s' -> actual '%s': '%s'" %(i, predict_sentiment[i], actual_sentiment[i], Ttest[i]))

The results shows the incorrect prediction along with the tweets. The failure shows the subtlety of the tweets, for example 'Oh happy day. Not', the network incorrectly predicted positive sentiment. The distant word 'Not' seems to be negation factor to the 'happy' word in the tweet.

Found 131592 incorrect labels of 473542
[241731] predict '0' -> actual '1': '@astynes Yr norty!  When we try copy a big disc like that we have to forfeit some of the quality 2 fit it on. It's a gr8 disc 2 watch tho!'
[235609] predict '1' -> actual '0': 'Oh happy day. Not'
[220989] predict '0' -> actual '1': 'My first day in Twitter It will be interesting I think So do you want to see my hometown?  http://www.admkrsk.ru/doc.asp?id=12'
[315661] predict '0' -> actual '1': 'Working on transfer papers, then lunch with Luke.'
[340588] predict '1' -> actual '0': 'business business business  bye bye awards

(3) Review the Most Confident Predicted Labels that are Correct

We can review the most confident predicted labels that are correct, for each class.

for sent in range(0,sentiment_len):
    correct_sent = np.where((predict_sentiment_idx==sent) & (predict_sentiment_idx==actual_sentiment_idx))[0]
    print("Found %d confident correct %d:'%s' category" % (len(correct_sent), sent, id2sentiment[sent]))
    if (len(correct_sent) > 0):
        most_correct_sent = np.argsort(predict_probs[correct_sent])[::-1][:4]
        for k in most_correct_sent:
            print("[%d] predict '%s' with %.3f confidence: '%s'" %(correct_sent[k], predict_sentiment[correct_sent[k]], predict_probs[correct_sent[k]], Ttest[k]))

The results of the most confident prediction along with the tweets.

Found 185735 confident correct 1:'1' category
[57266] predict '1' with 0.974 confidence: '@n8moses What fun! Hope you brought yummies for them'
[443611] predict '1' with 0.973 confidence: 'can't say I've had a Friday go this bad this fast before!'
[100613] predict '1' with 0.973 confidence: 'OMG my dog just attacked the baby possum  we think its ok but not positive.... -fingers crossed-'
[253526] predict '1' with 0.969 confidence: 'E3 looks awsome this year i missed microsofts keynote though  damn!'

(4) Review the Most Confident Predicted Labels that are Incorrect

We can review the most confident predicted labels that are incorrect, for each class.

for sent in range(0,sentiment_len):
    incorrect_sent = np.where((predict_sentiment_idx==sent) & (predict_sentiment_idx!=actual_sentiment_idx))[0]
    print("Found %d confident incorrect %d:'%s' category" % (len(incorrect_sent), sent, id2sentiment[sent]))
    if (len(incorrect_sent) > 0):
        most_incorrect_sent = np.argsort(predict_probs[incorrect_sent])[::-1][:4]
        for k in most_incorrect_sent:
            print("[%d] predict '%s' but actual '%s' with %.3f confidence: '%s'" %(incorrect_sent[k], predict_sentiment[incorrect_sent[k]], actual_sentiment[incorrect_sent[k]], predict_probs[incorrect_sent[k]], Ttest[k]))

The results of the most confident incorrect prediction along with the tweets.

Found 80332 confident incorrect 0:'0' category
[398966] predict '1' but actual '0' with 0.962 confidence: 'I cnt bel eive im saying this an ino i sudnt but i kinda love lois'
[425603] predict '1' but actual '0' with 0.961 confidence: 'was almost in a wreck and mcdonalds was too crowded for me to stop, so now i'm shaking AND hungry'
[340460] predict '1' but actual '0' with 0.961 confidence: '#phpkonferenca anže is beeing funny'
[87609] predict '1' but actual '0' with 0.958 confidence: 'Also my niece has better brand names than me. Girl is rockin pink baby phat blankets'

(5) Review by the Confusion Matrix

Perhaps the most common way to analyze the result of a classification model is to use a confusion matrix. scikit-learn has a convenient confusion_matrix() function for this purpose:

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(actual_sentiment_idx, predict_sentiment_idx)

We can just print out the confusion matrix, or we can show a graphical view (which is mainly useful for a larger number of categories).

import matplotlib.pyplot as plt

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    (This function is copied from the scikit docs.)
    """
    plt.figure(figsize=(8, 8), dpi=100)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    print(cm)
    thresh = cm.max() / 2.

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

Calling plot_confusion_matrix() to plot,

plot_confusion_matrix(cm, [id2sentiment[i] for i in range(0,sentiment_len)])

The confusion matrix looks like,

[[156215  80332]
 [ 51260 185735]]

Tweets Confusion Matrix

Figure 7. we can use the confusion matrix to easily inspect the distribution of correct and incorrect predictions

The model tends to incorrectly predict negative sentiment as positive sentiment more often.

Conclusion

Through the article's presentation of twitter analysis and prediction, we have learned how to apply the NLP technique to analyze text data and to use Deep Learning to predict the text sentiments. We believe that equipping with these technical knowledge, and working with the domain expert, we can apply a similar procedure to other business textual problems. Although this is hard to predict how many applications that we can imagine from this procedure, the sentiment is definitely "positive" as the machine may have predicted.

Hope you enjoyed the machine learning adventure on text data as much as we did!

Further Improvements

For further improvements, the analysis definitely gives some clues on how to enhance the model:

  • increase the vocabulary size
  • increase weights for those sentimental words
  • ignore unknown words that is not in English dictionary
  • improve the network with more advanced deep learning layers, e.g. RNN, for bigger context

References

About Jonah Group

Jonah Group is a digital consultancy the designs and builds high-performance software applications for the enterprise. Our industry is constantly changing, so we help our clients keep pace by making them aware of the possibilities of digital technology as it relates to their business.

  • 24,465
    sq. ft office in downtown Toronto
  • 128
    team members in our close-knit group
  • 18
    years in business, and counting