# Setting Up[¶](https://knowledge.dataiku.com/latest/courses/active-learning/setting-up.html#setting-up "Permalink to this headline")

Suppose you are the editor of a browser extension that allows users to block clickbait news articles. As a first step, you have crawled the web to gather a set of news titles.

## Supporting Data[¶](https://knowledge.dataiku.com/latest/courses/active-learning/setting-up.html#supporting-data "Permalink to this headline")

We will use the following two files:

* clickbait\_reported\_by\_users.csv. This contains 50 titles that you have already manually labeled.

* clickbait\_to\_classify.csv. This contains unlabeled titles.

These datasets are reformatted versions of the data provided in this repository (see [CPKG16]).

## Create the Project and Set the Code Environment[¶](https://knowledge.dataiku.com/latest/courses/active-learning/setting-up.html#create-the-project-and-set-the-code-environment "Permalink to this headline")

Create a new project and within the new project go to **… > Settings > Code env selection**.

* Deselect **Use DSS builtin Python env** and select an environment that has Python3 support.

* Click **Save**.

## Prepare the Data[¶](https://knowledge.dataiku.com/latest/courses/active-learning/setting-up.html#prepare-the-data "Permalink to this headline")

Upload the CSV files as new Uploaded files datasets with the following settings on the **Format/Preview** tab:

* Select Quoting style **Escaping only**

* Enter a Separator value of `\_`

* Enter a Skip first lines value of `0`

* Check **Parse next line as column headers**

Note

The data to classify has only one column, because it is not classified yet.

Later on, we’ll merge newly manually labeled data with the already labeled data. In order to prepare for this eventuality, an additional step is necessary.

* Select the *clickbait\_reported\_by\_users* dataset and create a Stack recipe.

* Create as output a dataset named *clickbait\_stacked* and change no other parameters.

* Run the recipe.

Later on, the dataset of labeled samples will be set as another input of the recipe.

The flow should now look like this:

## Building a First Model[¶](https://knowledge.dataiku.com/latest/courses/active-learning/setting-up.html#building-a-first-model "Permalink to this headline")

Before even considering doing active learning, one needs to fit a model on the already labeled data. If the performance is good enough, active learning may not even be needed.

* Select the *clickbait\_stacked* dataset, and then click **Lab**.

* Click **Quick model > Prediction**

* Select *clickbait* as the target column and then click **Automated Machine Learning > Quick prototypes > Create**.

DSS tries to guess the best parameters for the current task and set an analysis. However, since our input is not really standard, it is by default rejected by DSS.

* Go to the **Design** tab and then **Features handling**.

+ Click on *title* and set its Role as an input.

+ Choose TF/IDF vectorization and add English stop words.

* In the **Algorithms** settings, disable Random Forest, leaving Logistic Regression the only algorithm.

* In the **Runtime environment** settings, choose **Inherit project default** as the selection behavior for the code environment.

* Click **Train** and see your model learn! The performance of the model should be around 0.5.

* Click on the model and deploy it.

Note

**Troubleshooting.** Did your training fail because of a missing module? This is because your code environment doesn’t have the set of packages needed for Visual Machine Learning. Choose another code environment or talk to your DSS administrator.
