# Hands-On Tutorial: Deep Learning for Sentiment Analysis[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#hands-on-tutorial-deep-learning-for-sentiment-analysis "Permalink to this headline")

**Binary Sentiment Analysis** is the task of automatically analyzing a text data to decide whether it is *positive* or *negative*. This is useful when faced with a lot of text data that would be too time-consuming to manually label.

In addition to simpler options, Dataiku enables you to build a convolutional neural network model for binary sentiment analysis.

## Objectives[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#objectives "Permalink to this headline")

This tutorial walks through how to build a convolutional network for sentiment analysis, using Keras code in Dataiku’s Visual Machine Learning. After building an initial model, we’ll use pre-trained word embeddings to improve the preprocessing of inputs, and then evaluate both models on test data.

## Prerequisites[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#prerequisites "Permalink to this headline")

Before beginning this tutorial you should have:

* Some experience with Deep Learning with code in Dataiku.

* Some familiarity with Keras.

You will also need to create a code environment, or use an existing one, that has the necessary libraries. When creating a code environment, you can add sets of packages on the **Packages to Install** tab. Choose the Visual Deep Learning package set that corresponds to the hardware you’re running on.

## Preparing the Data[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#preparing-the-data "Permalink to this headline")

We will be working with IMDB movie reviews. The original data is from the Large Movie Review Dataset, which is a compressed folder with many text files, each corresponding to a review. In order to simplify this how-to, we have provided two CSV files:

* Training data

* Test data

Download these CSV files, then create a new project and upload the CSV files into two new datasets.

## The Deep Learning Model[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#the-deep-learning-model "Permalink to this headline")

In a Visual Analysis for the training dataset, create a new model with:

* **Prediction** as the task,

* *polarity* as the target variable

* **Expert mode** as the prediction style

* **Deep learning** as the Expert mode, then click **Create**

This creates a new machine learning task and opens the Design tab for the task. On the Target panel, verify that Dataiku DSS has correctly identified this as a Two-class classification type of ML task.

### Features Handling[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#features-handling "Permalink to this headline")

On the Features Handling panel, turn off *sentiment* as an input, since *polarity* is derived from *sentiment*.

Dataiku should recognize *text* as a column containing text data, set the variable type to Text, and implement custom preprocessing using the TokenizerProcessor.

We can set two parameters for the Tokenizer:

* *num\_words* is the maximum number of words that are kept in the analysis, sorted by frequency. In this case we are keeping only the top 10,000 words.

* *max\_len* is the maximum text length, in words. 32 words is too short for these reviews, so we’ll raise the limit to the first 500 words of each review.

### Deep Learning Architecture[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#deep-learning-architecture "Permalink to this headline")

We now have to create our network architecture in the `build\_model()` function. We won’t use the default architecture, so just remove all the code. Then, click on **{} Code Samples** on the top right and search for “text”. Select the **CNN1D** architecture for text classification.

Insert the CNN1D code then click on **Display inputs** on the top left. You should see that the “main” feature is empty because we are only using the review text, which is in the input *text\_preprocessed*.

In order to build the model, we only need to make a small change to the code. In the line that defines `text\_input\_name`, change `name\_of\_your\_text\_input\_preprocessed` to `text\_preprocessed`.

### Model Results[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#model-results "Permalink to this headline")

Click **Train** and, when complete, deploy the model to the flow, create an evaluation recipe from the model, and evaluate on the test data. In the resulting dataset, you can see that the model has an accuracy of about 80% and an AUC of about 0.89. It’s possible that we can improve on these results by using pre-trained word embeddings.

## Building a Model using Pre-Trained Word Embeddings[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#building-a-model-using-pre-trained-word-embeddings "Permalink to this headline")

The fastText repository includes a list of links to pre-trained word vectors (or embeddings) (P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information). In order to use the fastText library with our model, there are a few preliminary steps:

* Download the English bin+text word vector and unzip the archive

* Create a folder in the project called *fastText\_embeddings* and add the *wiki.en.bin* file to it

* Add the fastText library to your deep learning code environment (or create a new deep learning code environment that includes the fastText library). You can add it with `git+https://github.com/facebookresearch/fastText.git` in the Requested Packages list, as shown in the following screenshot.

### Features Handling[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#id1 "Permalink to this headline")

In the Features Handling panel of the Design for our deep learning ML task, add the following lines to the custom processing of the text input.

§ from dataiku.doctor.deep\_learning.shared\_variables import set\_variable

§ set\_variable("tokenizer\_processor", processor)

### Model Architecture[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#model-architecture "Permalink to this headline")

We need to make few changes. Add the following imports to the top of the code.

§ import dataiku

§ from dataiku.doctor.deep\_learning.shared\_variables import get\_variable

§ import os

§ import fasttext

§ import numpy as np

Within the `build\_model()` specification, add the code for loading the embeddings and making the embedding matrix. This needs to occur before the line that defines `emb`.

§ folder = dataiku.Folder('fastText\_embeddings')

§ folder\_path = folder.get\_path()

§ embedding\_size = 300

§ embedding\_model\_path = os.path.join(folder\_path, 'wiki.en.bin')

§ embedding\_model = fasttext.load\_model(embedding\_model\_path)

§ processor = get\_variable("tokenizer\_processor")

§ sorted\_word\_index = sorted(processor.tokenizer.word\_index.items(),

§ key=lambda item: item[1])[:vocabulary\_size-1]

§ embedding\_matrix = np.zeros((vocabulary\_size, embedding\_size))

§ for word, i in sorted\_word\_index:

§ embedding\_matrix[i] = embedding\_model.get\_word\_vector(word)

Change the definition of the embedding layer as follows, in order to use the fastText pre-trained word embeddings.

§ emb = Embedding(vocabulary\_size,

§ embedding\_size,

§ input\_length=text\_length,

§ weights=[embedding\_matrix],

§ trainable=False)(text\_input)

Change the second `MaxPooling` layer for a `GlobalMaxPooling` layer.

§ x = GlobalMaxPooling1D()(x)

Finally, remove the `x = Flatten()(x)` line.

### Model results[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#id2 "Permalink to this headline")

Click **Train** and, when complete, redeploy the model to the flow, and reevaluate on the test data. In the resulting dataset, you can see that the model has an accuracy of about 87% and an AUC of about 0.94.

## Wrap Up[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/nlp-code/index.html#wrap-up "Permalink to this headline")

* See a completed version of this project on the Dataiku gallery.

* See the Dataiku DSS reference documentation on deep learning.
