# Advanced Data Preparation Quick Start[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#advanced-data-preparation-quick-start "Permalink to this headline")

Prepare data and build a machine learning model in this quick start tutorial for data analysts or citizen data scientists new to Dataiku.

Contents

* Getting Started

* Create and Explore the Project

* Data Preparation Part I - Connect to Data Sources

* Data Preparation Part II - Combine Input Datasets

* Data Preparation Part III - Geo Processing

* Charts

* Build a Machine Learning Model

* Schedule the Build using a Scenario

* Join Datasets and Keep Unmatched Rows

* What’s Next?

## Getting Started[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#getting-started "Permalink to this headline")

**Dataiku** is a collaborative, end-to-end machine learning platform that unites analysts, citizen data scientists, and data scientists in a common space to bring faster business insights.

In this tutorial, you’ll get hands-on practice with Dataiku by preparing and joining data to predict credit card fraud. You’ll also use the Dataiku automated machine learning engine to build a highly optimized model with minimal manual effort.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

### Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#prerequisites "Permalink to this headline")

To complete this tutorial, you’ll need the following:

* Access to a Dataiku instance - version 11.0 or above. The free edition is compatible, or you can start a 14-Day Free Online Trial

* The cardholder\_info CSV Zip file. You’ll upload this file during the tutorial.

* The Reverse Geocoding plugin. This is needed to complete the section on Geographical processing. If your instance of Dataiku does not already have this plugin, you’ll need to install it. To learn about installing a plugin, visit Installing plugins. This plugin is included in the 14-Day Free Online Trial.

### Objectives[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#objectives "Permalink to this headline")

Our primary goal is to explore and improve an existing project Flow (workflow) that will be used to build an interpretable machine learning model.

You’ll see how Dataiku can be used to meet your data preparation and machine learning needs–and more.

#### What We’re Building[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#what-we-re-building "Permalink to this headline")

We’ll be working with an existing project that contains input datasets. We’ll build a pipeline by stacking and joining datasets and applying data transformations. With our prepared data, we’ll build a machine learning model and deploy it to the Flow. When you have completed the tutorial, your project Flow will look like this:

Note

If you are using Dataiku Online, the dataset icons in the Flow may appear differently than shown here. Dataiku uses an icon that represents the underlying storage format of the dataset.

The final Flow contains recipes and datasets. **Recipes** can be thought of as **tools**. The initial datasets (also known as input datasets) are on the left of the Flow. In this project, the input datasets are CSV files containing information that, when joined and processed, can be used to predict if a credit card transaction is fraudulent.

##### About the Visual Flow[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#about-the-visual-flow "Permalink to this headline")

The final Flow is organized into two Flow Zones:

* one for loading and preparing the data, and

* another zone for building the machine learning model pipeline.

The Flow is composed of data pipeline elements, such as datasets and recipes. Recipes in Dataiku are used to prepare and transform datasets. Their icons are easy to spot in the Flow because they are yellow circles, whereas datasets are blue squares. Machine learning processes are represented in green.

The data pipeline starts at the left and includes both input and intermediate datasets (both are indicated as blue squares). These intermediate datasets make it possible for you to start anywhere, not just from the beginning at the left, and build part of the Flow.

#### How We’ll Build The Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#how-we-ll-build-the-project "Permalink to this headline")

Our primary goal is to build an interpretable machine learning model that can be used to predict whether or not a credit card transaction is fraudulent.

To build our pipeline, we’ll first prepare and join datasets. Then, we’ll build a binary classification model.

#### How to Navigate in Dataiku[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#how-to-navigate-in-dataiku "Permalink to this headline")

Throughout this tutorial, we’ll be using the top navigation bar, the flow zones, and the right-side panel to navigate and perform actions.

In upcoming sections, we’ll explore the datasets in the Flow and plan the project. Then we’ll apply transformations to the dataset and use it to build a machine learning model. Finally, we’ll learn how to customize a join recipe.

* Proceed to the next lesson.

## Create and Explore the Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-and-explore-the-project "Permalink to this headline")

In this section, we’ll create and plan our project; identify business needs; and pinpoint the necessary transformations to our input data.

### Create the Project[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-the-project "Permalink to this headline")

In this section, we’ll create the project. To do this, we’ll open a tutorial that already has the input datasets we want to transform.

To open the tutorial:

* Sign in to your Dataiku instance.

Upon launching Dataiku, the first page you’ll see is the homepage. The homepage lets you browse projects, recent items, dashboards, and applications shared with you on your instance.

* From the Dataiku homepage, click on **+New Project > DSS Tutorials > Quick Start > Advanced Data Preparation (Tutorial)**.

Note

You can also download the starter project from this website and import it as a zip file.

Dataiku opens the Summary tab of the project, also known as the project homepage.

* Click on the **Flow** in the top navigation bar.

### View a Dataset Discussion and the Project Wiki[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#view-a-dataset-discussion-and-the-project-wiki "Permalink to this headline")

The many collaboration features available in Dataiku make it easy for team members on the same instance to share and communicate.

To help us get started analyzing our input datasets, we can explore project comments, descriptions, and features like the project Wiki. Doing this will help us get oriented whenever we open a project. On this project, we don’t have to look far.

There is already a discussion on one of our input datasets, *transactions\_2017*. This is indicated by a discussion icon.

Let’s view the discussion:

* Select the *transactions\_2017* dataset then open the right-side panel.

* Click the **Discussions** icon, and then click anywhere on the text of the discussion to open it.

The discussion displays a message from a business analyst requesting information about the dataset. Specifically, they want to know what is meant by the values in the *authorized\_flag* column. We can check the project Wiki to see if there is any preliminary information about this dataset.

* From the top navigation bar, click the **Wiki** menu.

Since there is only one article, **Project Read Me**, Dataiku opens it. This article includes some preliminary information about the Flow zones and the input dataset. Using this information, we know that the fraudulent (or unauthorized) transactions are labeled as “0” in the *authorized\_flag* column.

In the next section, we’ll formally explore the *transactions\_2017* dataset by obtaining some quick insights and checking the data quality. Later, we can revisit the discussion and leave a reply.

### Explore the Dataset[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#explore-the-dataset "Permalink to this headline")

Dataiku has many features that help you quickly explore a dataset. In this section, we’ll try out a few. Specifically, we’ll explore the *transactions\_2017* dataset by configuring a sample of its data from the Explore tab. We’ll then analyze a column of the dataset on this sample.

#### Explore the Dataset Using the Explore Tab[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#explore-the-dataset-using-the-explore-tab "Permalink to this headline")

Whenever you open a dataset from the Flow, Dataiku displays the **Explore** tab. In this section, we’ll explore the *transactions\_2017* dataset using its Explore tab.

* From the top navigation bar, click the **Flow** icon to go to the Flow.

* In the Flow, double-click the *transactions\_2017* dataset to open it.

The **Explore** tab provides a tabular view of where we can start to examine the data. For this dataset, each row is a transaction. The *authorized\_flag* shows whether the transaction was approved or not, where 1 = authorized and 0 = fraudulent.

Beneath each column name is the storage type and meaning. For example, Dataiku detects a Date meaning for the *purchase\_date* column since the sample in this column contains date values.

We can use the data quality bar at the top of the column to visually gauge whether or not the sample contains:

* values that match the detected meaning (green bar),

* values that do not match the detected meaning (red bar), or

* missing values (gray bar).

In this sample, there are no missing values, and all the values match the detected meaning, therefore, the bar is completely green.

Note

The Explore tab offers many options for examining your dataset, including selecting which columns display, and getting quick stats.

#### Explore Using a Data Sample[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#explore-using-a-data-sample "Permalink to this headline")

In this section, we’ll open the Sample settings to see the different ways we could explore the *transactions\_2017* dataset using different sampling methods.

Dataiku only shows a sample of the dataset when you are working interactively with it. This is known as sampling.

* To see the total number of records in this dataset, click on the **Compute row count** button (the two arrows next to **not computed**).

* Then click the **Sample** button to display the **Sample settings**.

By default, the sampling method is set to **First records** and the number of records to `10,000`. This is the fastest sampling method. You can configure the sampling method and the number of rows depending on your use case.

* Click the **Sample** button again to close the **Sample settings**.

#### Explore by Analyzing a Column[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#explore-by-analyzing-a-column "Permalink to this headline")

In this section, we’ll explore the *transactions\_2017* dataset by analyzing a column.

Often you want to perform a quick statistical analysis while exploring your data. Using the **Analyze tool** on a column shows the distribution and key metrics that can guide tasks, such as data cleaning or class rebalancing. These statistics can be calculated on the sample or the whole dataset.

* Click the *authorized\_flag* column name and choose **Analyze** from the menu.

Dataiku displays the percentage of valid, invalid, and empty values, as well as those values which appear only once. You can also see that 88.6% of the records from the sample are flagged as authorized, while the rest are flagged as fraudulent. The dataset is imbalanced - this is very common in machine learning.

As we build our data pipeline and prepare our data for training a machine learning model, we’ll need to take this imbalance into account. We’ll also need to better describe the “0’s” and “1’s” in the *authorized\_flag* column. But first, we’ll start with connecting to data to make sure we have all the data we need for this project.

##### On Your Own[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#on-your-own "Permalink to this headline")

In this section, we used the Analyze tool to calculate statistics on a sample of the dataset. Try calculating the statistics on the **Whole data**. You should see that closer to 90% of the records are actually flagged as authorized.

* Proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

## Data Preparation Part I - Connect to Data Sources[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#data-preparation-part-i-connect-to-data-sources "Permalink to this headline")

In this section, we’ll add a new dataset to the Flow, in addition to those already present in the initial starting project.

* Go to the **Flow**.

* Click **+Dataset** in the top right corner of the Flow. Dataiku displays options for connecting to data such as uploading a file and connecting to SQL databases or Cloud storage.

When creating a new project, you’ll likely have data coming from various sources such as SQL databases. Dataiku makes it easy to connect to your data.

* Click **Upload your files**.

* Add the cardholder\_info file.

* Create the dataset. Dataiku creates the dataset and displays the Explore tab. This dataset contains cardholder information and includes a column, *internal\_card\_mapping*, that maps to the *card\_id* column in our transactions datasets. Later, we’ll use this column to join our datasets.

* Return to the **Flow**. To do this, you can always use the top navigation bar or the keyboard shortcut `G+F`.

In the next section, we’ll combine our input datasets and create a lookup table to describe the values in the *authorized\_flag* column of our transactions datasets and combine our input datasets.

* Proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

## Data Preparation Part II - Combine Input Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#data-preparation-part-ii-combine-input-datasets "Permalink to this headline")

One of our goals is to build a machine learning model that can be used to predict if a credit card transaction is fraudulent or not. To do this, we’ll need to feed our model variables (or features) about these transactions. Before we can do this, we need to create a dataset that combines all of the information from the different input datasets.

In this section, we’ll use visual recipes such as **Stack** and **Join** to combine our input datasets. We’ll also add a lookup table, to satisfy the business request of making the *authorized\_flag* more descriptive. Later, we’ll apply transformations using built-in processors found in the Prepare recipe, and analyze the results.

### Stack Two Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#stack-two-datasets "Permalink to this headline")

The datasets, *transactions\_2017* and *transactions\_2018* both contain information about credit card transactions taking place in 2017 and 2018. We can combine these datasets using the Stack recipe.

* From the top navigation bar, click the **Flow** icon to go to the Flow.

* Select the *transactions\_2017* dataset.

* Hold down the **Shift** key and select the *transactions\_2018* dataset.

* Open the right-side panel. Dataiku displays available actions applicable when you select two datasets in the Flow.

* Under **Visual recipes**, choose **Stack**. Dataiku displays the **New stack recipe** window.

* Name the output dataset `transactions\_stacked` instead of the default name.

* **Create** the recipe. Dataiku displays the settings step.

* In the **Selected columns** step, for the **Columns selection** option, choose **Union of input schemas** (which is selected by default).

* **Run** the recipe.

Dataiku will select the available execution engine that it detects to be the most optimal. For this tutorial, that is the **DSS engine**.

If, for example, the input and output datasets were stored in a SQL database, and the recipe was a SQL-compatible operation, then the **in-database engine** would be chosen. This is not just for SQL code recipes. Many visual recipes and processors in the Prepare recipe can be translated to SQL queries.

In other situations, such as when working with HDFS or S3 datasets, the **Spark engine** would be available if it is installed.

* When the job has successfully completed, click **Explore dataset transactions\_stacked**.

Note

Actions in Dataiku, such as running a recipe or training a model, generate a job. You can view ongoing or recent jobs in the Jobs page from the top navigation bar or using the keyboard shortcut `G+J`.

Our output dataset looks like this:

* Return to the **Flow**.

Each time we apply a transformation to a dataset in the Flow, Dataiku places the output dataset directly after the recipe. Dataiku continues to organize the Flow for us, filtering out choices that would be inappropriate. This visual guidance gives us a way to immediately view the transformed dataset and easily return to its parent recipe.

### Create the Lookup Table[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-the-lookup-table "Permalink to this headline")

We can create an empty editable dataset to be used as a lookup table. This will help make the numeric values in the *authorized\_flag* column more meaningful. The new column we create, *authorized*, will be used to build the machine learning model–it will be our target (what we want to predict).

In our lookup table, we’ll describe “1” as “authorized” and “0” as “fraudulent”. When we have completed this task, the *authorized\_flag* column and our new *authorized* column will have the same statistics:

* From the top navigation bar, click the **Flow** icon to go to the Flow.

* In the upper-right corner of your screen, click **+ Dataset** then choose **Editable**.

* With **Create empty editable dataset** selected, name it `authorized\_flag` then click **Create**.

Let’s now populate our table with values.

* Go to the **Edit** tab.

* Right-click the table to open the menu, then choose **Insert column after** to add a column.

* Right-click the table to open the menu, then choose **Insert row below** to add a row.

* Right-click the first column, then choose **Edit column** to edit the column’s name. Name the first column, `authorized`.

Note

Both column storage types should be *string*. This is necessary because the column, *authorized\_flag*, in our transactions dataset is set to *string*. Later, when we join our datasets on the *authorized\_flag* column, we’ll want the storage types to match.

* Edit the second column and name it `authorized\_flag`.

* Type the values `authorized` and `1` in the first row, and the values `fraudulent` and `0` as shown below:

Hint

Recall that when we viewed the project Wiki we discovered that the input dataset *transactions\_2017* contains known fraudulent transactions which are identified as “0” in the authorized\_flag column. For this reason, be sure that the first row contains the values `authorized` and `1`, while the second row contains the values `fraudulent` and `0`.

* **Save** the dataset.

Now we can join our stacked dataset with our lookup table.

### Join Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#join-datasets "Permalink to this headline")

In this section, we’ll join *transactions\_stacked* and *authorized\_flag*.

* Return to the Flow and double click the dataset *transactions\_stacked* to open it.

* Open the right-side panel and choose **Join with…** from the **Visual recipes** section. Dataiku displays the **New join recipe** window.

* Select *authorized\_flag* as the second dataset.

* Name the output dataset `transactions\_joined` instead of the default name.

* Leave the default options for **Store into** and **Format**. If you are using Dataiku Online, these options will be **dataiku-managed-storage** and **Parquet** rather than **filesystem\_managed** and **CSV** as shown in the image.

* **Create** the recipe.

Dataiku displays the settings for the Join step. We’ll need to review these settings before running the recipe. Dataiku has selected a left join by default and detected the column on which to join: *authorized\_flag*. This is correct.

To view and change any of these settings, you can click on the **Left join** to view join types and on the visual graphic to view the join conditions.

Next, we’ll review the selected columns, and select the columns of each dataset whose values we want to keep in the output dataset.

* Go to the **Selected columns** step in the left-side panel.

Dataiku has selected just the *authorized* column from our *authorized\_flag* dataset. We’ll keep this setting. For now, we’ll need to keep the *authorized\_flag* column in the *transactions\_stacked* dataset. We’ll make use of this numeric column later on when creating a map for visual analysis. We’ll need to remove it before training our machine learning model.

* Keep the default column selection.

* **Save** the recipe.

* **Run** the recipe.

Now that we’ve joined our stacked dataset with the lookup table, let’s add our other input datasets. There is no reason to add another Join recipe, we’ll simply add the datasets to our existing Join recipe.

### Join More Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#join-more-datasets "Permalink to this headline")

Recall that one of our goals is to train a machine learning model by feeding it meaningful variables, or features. We’ve already discovered that two of our input datasets represent transactional data.

The other two input datasets represent merchant and cardholder information. These datasets contain latitude and longitude columns, among other useful information. If we could combine this information with our transactional data, the information could be useful in analyzing credit card fraud patterns and training our machine learning model.

Let’s add these datasets to our *transactions\_joined* dataset.

* From the Flow, open the Join recipe, *compute\_transactions\_joined*.

* Go to the **Join** step.

* Click the **+** button next to *transactions\_stacked* to add a new dataset to join with *transactions\_stacked*.

* In the info window that appears, select the *cardholder\_info* dataset.

* Click **Add dataset**.

As before, Dataiku displays the settings for the Join step. We’ll need to review these settings before continuing.

Warning

Dataiku has selected a left join by default and has not detected the columns on which to join. We’ll need to remap this join.

* Click **Add a condition** to display the column selection settings.

* Choose *card\_id* column from *transactions\_stacked*.

* Check that *internal\_card\_mapping* is selected from *cardholder\_info*.

* Select **OK** to close the Join settings pop-up window.

Before we review the selected columns step, let’s add our final dataset, *merchant\_info*.

* Go back to the **Join** step and add another dataset to join with *transactions\_stacked*.

* Choose *merchant\_info* as the new input dataset and click **Add dataset**. The default join type of left join is correct.

* Click the visual join graphic to view the conditions of this newly added join. Dataiku has detected two join columns: *merchant\_id* and *merchant\_category\_id*. This is correct, so click **OK** to go back to the Join recipe settings.

As before, we’ll need to review the selected columns and select the columns of each dataset whose values we want to keep in the output dataset.

* Go to the **Selected columns** step.

* In *cardholder\_info* add a prefix of `card` to help identify the origin of the columns.

* Remove the first column, *card\_internal\_card\_mapping*. We won’t need this column.

* In *merchant\_info* add a prefix of `merchant`.

* Select *merchant\_latitude* and *merchant\_longitude* as additional columns.

* **Save** and **Run** the recipe to build the output dataset, *transactions\_joined*.

* When the job has successfully completed, click **Explore dataset transactions\_joined**.

* Return to the **Flow**.

The Flow now looks like this:

In the next section, we’ll apply date formatting, ID handling, and geographic processing to our dataset. Once our dataset is prepared, we’ll build a machine learning model.

* Proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

#### On Your Own[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#id9 "Permalink to this headline")

Now would be a great time to post a reply to the discussion on the *transactions\_2017* dataset, letting your team members know that you’ve added a new column, *authorized*, to the *transactions\_joined* dataset that describes the *authorized\_flag* column. To do this, select the *transactions\_2017* dataset, select to view the discussion, then post a reply.

## Data Preparation Part III - Geo Processing[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#data-preparation-part-iii-geo-processing "Permalink to this headline")

Now that we’ve stacked and joined our input datasets and added a lookup table, we’ll continue transforming our data to create features that will be used by our machine learning model. Specifically, we’ll apply date formatting, ID handling, and geographic processing. We’ll accomplish all of this in a single recipe, the Prepare recipe. The Prepare recipe has built-in processors, including geographic processors. We’ll even be able to group similar steps together.

### Format Dates[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#format-dates "Permalink to this headline")

In this section, we’ll apply date formatting and calculations to several columns in our joined dataset. We’ll also create a column using a formula.

#### Create the Prepare Recipe[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-the-prepare-recipe "Permalink to this headline")

* In the Flow, click the *transactions\_joined* dataset once to select it.

* Open the right-side panel and choose the **Prepare** recipe from the **Visual recipes** section.

* Name the output dataset `transactions\_joined\_prepared`.

* Create the recipe.

Dataiku displays the **Script** tab of our *compute\_transactions\_joined\_prepared* recipe.

#### Parse Dates[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#parse-dates "Permalink to this headline")

Our first task is to parse, or format, the dates of our *card\_first\_active\_month* column to a standard date format so that we can make calculations with it.

* Go to the column *card\_first\_active\_month*.

Note

To make this task easier when working with many columns, you can press **C** on your keyboard to display the column search, then start typing *card* to search for the *card\_first\_active\_month* column.

* Click the column header to view the drop-down menu. Dataiku displays suggested actions that are contextual–that is, based on each column’s meaning. In this case, Dataiku has detected unparsed dates in this column, and one of the suggested actions is Parse date.

* Click **Parse date**.

Dataiku has detected two formats for this date column along with sample inputs and a preview of the output column. We could even add our own custom format here.

* Click **Use Date Format** to parse the date using the detected format **yyyy-MM**.

* Using this same method, parse the date in the *purchase\_date* column, creating a new *purchase\_date\_parsed* column.

#### Extract Date Components[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#extract-date-components "Permalink to this headline")

Now that we have parsed the date in *purchase\_date*, we can extract date components from it.

* Go to the column *purchase\_date\_parsed*.

* Click the column header to view the drop-down menu and suggested actions.

Dataiku displays the suggested action, **Extract date components**, based on the values in the column.

* Click **Extract date components**.

* In the **Script** tab, define the output columns as follows:

+ Set the **Year column** to `purchase\_year`.

+ Set the **Month column** to `purchase\_month`.

+ Set the **Day column** to `purchase\_day`.

+ Set the **Day of week column** to `purchase\_dow`.

+ Set the **Hour column** to `purchase\_hour`.

* Click **Save**.

* Collapse the **Extract date** step by clicking on it.

#### Compute a Column Using a Formula[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#compute-a-column-using-a-formula "Permalink to this headline")

Next, we want to create a column, *purchase\_weekend*, using a formula. This column will identify which credit card transactions occurred on a weekend.

* Click **+ Add a New Step** in the **Script** tab. Dataiku displays the processors library.

* In the processors library, search for and select the **Formula** processor.

* Name the **Output column** `purchase\_weekend`.

* Click **Open Editor Panel**.

* In the **Expression** field, type: `if(purchase\_dow>5,1,0)`. Dataiku validates the formula.

* Click **Apply**.

Note

The day of week column identifies Saturday and Sunday as 6 and 7. This expression labels these days as the weekend.

* Click **Save**.

* Collapse the step by clicking on it.

#### Compute the Time Difference Between Two Columns[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#compute-the-time-difference-between-two-columns "Permalink to this headline")

Now that we have parsed *card\_first\_active\_month\_parsed* and *purchase\_date\_parsed*, we can use the standardized dates to compute their time difference. This information could be useful in analyzing credit card fraud patterns and could be a useful feature for training our machine learning model.

* Go to the column *card\_first\_active\_month\_parsed* and click the header to view the drop-down menu.

* Click **Compute time since**.

* In the **Script** tab, define the output columns as follows:

+ Set **until** to **Another date column**.

+ Set the **Other column** to *purchase\_date\_parsed*.

+ Name the **Output column** `days\_active`.

* Click **Save**, and collapse the step by clicking on it. Using this Prepare recipe, we can see that Dataiku always lets us preview the transformation even before running the recipe.

* **Run** the recipe.

#### Group Similar Steps Together[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#group-similar-steps-together "Permalink to this headline")

Since we want to continue working in this Prepare recipe to add further transformations, let’s group our steps so they are easier to manage.

* In the **Script** tab, select all steps by clicking the checkbox for each step.

* Click the **Actions** menu, and then choose **Group**.

* Name the group something like `Date Processing`.

We’ll continue working in this Prepare recipe in the next section.

### Compute Geographic Features[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#compute-geographic-features "Permalink to this headline")

Using the merchant and cardholder geographical data, we can compute the distance between card location and merchant location. This information could be useful in analyzing credit card fraud patterns. Having these additional columns will make useful features for training our machine learning model.

To accomplish this, we’ll use geographical processors available in the Prepare recipe: Create GeoPoint, Reverse-geocode, and Compute distance.

Specifically, we’ll be adding five new steps to our Script:

Let’s begin!

#### Create GeoPoints[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-geopoints "Permalink to this headline")

In this section, we’ll create merchant and cardholder GeoPoints. To make this process more efficient, we’ll first create our merchant GeoPoint, then copy the step, and use it to create our cardholder GeoPoint.

##### Create Merchant GeoPoint[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-merchant-geopoint "Permalink to this headline")

* Return to the Prepare recipe if you’ve closed it.

* In the **Script** tab, click **+ Add a New Step**.

* In the processors library, search for `Create Geo`, and then choose **Create GeoPoint from lat/lon**.

* In the **Script**, define the configuration as follows:

+ Set the **Input latitude column** to `merchant\_latitude`.

+ Set the **Input longitude column** to `merchant\_longitude`.

+ Set the **Output GeoPoint column** to `merchant\_location`.

Note

If the latitude and longitude columns are missing, return to the Join recipe and check the Selected columns step to be sure you’ve selected the columns.

##### Create Cardholder GeoPoint[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-cardholder-geopoint "Permalink to this headline")

* Click the **More options** menu of the last step and choose **Duplicate step**.

* In the new, duplicated step, define the output columns as follows:

+ Set the **Input latitude column** to `card\_latitude`.

+ Set the **Input longitude column** to `card\_longitude`.

+ Set the **Output GeoPoint column** to `card\_location`.

* Click **Save**, and collapse the last step by clicking on it.

#### Apply Reverse GeoCoding[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#apply-reverse-geocoding "Permalink to this headline")

Let’s reverse geocode our merchant location. This will allow us to add additional columns to our dataset including merchant and cardholder state.

Note

To use the Reverse-geocoding processor, you must first install the Dataiku plugin called **Reverse Geocoding**. Please see Installing plugins. This plugin is included in the 14-Day Free Online Trial.

* In the **Script** tab, click **+ Add a New Step**.

* In the processors library, search for `reverse` and select **Reverse geocoding**.

Dataiku displays several output columns to configure, but we’ll only need one.

* Set the **Input column** to `merchant\_location``\*`.

* Go to **Output column for level 4 (region)** and name it `merchant\_state`.

Similarly, we’ll reverse geocode our card location.

* In the **Script** tab, click **+ Add a New Step**.

* In the processors library, search for `reverse` and select **Reverse geocoding**.

As before, we’ll only need one of the output columns.

* Set the **Input column** to `card\_location`.

* Go to **Output column for level 4 (region)** and name it `card\_state`.

* Click **Save**, and collapse the last step by clicking on it.

#### Compute the Distance Between Two GeoPoints[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#compute-the-distance-between-two-geopoints "Permalink to this headline")

With our merchant and card location geopoints computed, we can compute the distance between them.

* In the **Script** tab, click **+ Add a New Step**.

* In the processors library, search for `compute distance`, and then select **Compute distance between geopoints**.

* In the **Script**, define the configuration as follows:

* Configure a distance computation between column to *card\_location* and **Another geopoint column**.

* Set the **Other column** to `merchant\_location`.

* Set the **Output distance unit** to **Miles**.

* Set the **Output column** to `merchant\_cardholder\_distance`.

* Click **Save**, and collapse the last step by clicking on it.

#### Group the Geo Processing Steps Together[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#group-the-geo-processing-steps-together "Permalink to this headline")

Let’s group our geo processing steps together so that they are easier to identify.

* In the **Script** tab, select all geo processing steps by clicking the checkbox for each step.

* Click the **Actions** menu, and then choose **Group**.

* Name the group something like `Geo Processing`.

* **Save** and **Run** the recipe. Wait while Dataiku finishes running the recipe.

* Return to the **Flow** and proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

## Charts[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#charts "Permalink to this headline")

In this section, we’ll build a type of chart known as a filled administrative map, showing the average of authorized transactions by merchant location. Such a map will allow us to visualize the relationship between merchant location and transactions flagged as fraudulent.

At the end of this section, you will have created a map and published it to the dashboard.

### Create a Chart[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-a-chart "Permalink to this headline")

In this section, we’ll build a type of chart known as a filled administrative map, showing the average of authorized transactions by merchant location.

In the Explore the Dataset section, we learned about a few ways to perform preliminary dataset investigations using the Explore tab. In this section, we’ll work with the Charts tab. Similar to the Explore tab, the Charts tab is available whenever you open a dataset in the Flow.

Let’s create a map that will allow us to visualize the geographic relationship between merchant location and transactions flagged as fraudulent.

Since the values of *authorized\_flag* are numeric (consisting of “0’s” and “1’s”), we can find the average and map this average by merchant location.

* Go to the **Flow**.

* Open the *transactions\_joined\_prepared* dataset.

* Click on the **Charts** tab.

* Click the chart type menu (which by default is set to **Vertical bars**) in the upper left to view the chart options and types.

* Scroll down to the map chart types, then select the **Administrative map (filled)**.

* From the **Columns** panel on the left, search for the *merchant\_location* column.

* Drag and drop *merchant\_location* to the **Show** box on the right.

* From the **Columns** panel, search for *authorized\_flag*.

* Drag and drop *authorized\_flag* to **Details** to define the color of our zones.

* In the **Show** box, click on the column name, *merchant\_location*, to view the **Admin level** menu.

* Choose **Region/State** from the menu.

* Open the **Color** panel, then select **Red-green** from the **Palette** option.

* Edit the title of the dashboard to `Average of Authorized Flag by Merchant Location`.

Your chart now looks like this upon zooming in:

**Review the Initial Chart**

The chart appears to reveal that there is a higher than average number of transactions flagged as fraudulent in very few regions on the map. However, this result may not reflect the true composition of our data.

Recall that when we analyzed our authorized\_flag column using the Analyze tool, we discovered that only about 10% of the transactions were flagged as fraudulent. By default, Dataiku uses the same sample in Charts as in the Explore tab. Let’s check the sampling configuration for this chart to optimize it.

**Reconfigure the Sampling**

* Go to the **Sampling & Engine** panel, which is next to the Columns panel.

* Clear the checkbox **Use same sample as explore** so that you can choose a different sampling method.

* In **Sampling method**, select **Random (approx. nb. records)**.

* Change the **Nb. records** from 10000 to `100000` (one hundred thousand).

* Click **Save and Refresh Sample** to update the map.

Due to the sampling method, your chart may differ from the one shown below.

Using a random sampling method rather than the default sampling method of First records results in a significant change in our visualization. The random-sampling results reveal that the higher-than-average number of transactions flagged as fraudulent are concentrated in parts of the country that the default sampling was hiding.

### Publish the Chart to a Dashboard[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#publish-the-chart-to-a-dashboard "Permalink to this headline")

Using Dataiku, you can share visual insights with AI consumers. For example, a project owner could configure a project dashboard so that it displays on the homepage for those who already have access to the project.

Note

Visit the AI Consumer Quick Start–Consume Insights in a Dashboard lesson to learn how you can share elements of your data project with other users, including ones who may not have full access to your project.

To publish the chart:

* In the top right corner of the chart, click **Publish**.

Dataiku displays the **Create insight and add to dashboard** info window. The window includes a warning that the chart is based on a sample of the data.

* Click **Create**.

Dataiku creates the insight and adds it to the **Credit Card Fraud Patterns** dashboard on the **Credit Card Fraud Patterns** slide.

* Resize the chart on the slide by dragging the handles.

* **Save** your changes.

You can interact with and zoom in on the map in the **View** tab (near the top right of your window).

Let’s add a description to our dashboard to help AI consumers understand its purpose.

* Go to the **Details** panel on the right by clicking on the **i** button.

* Click **+ Add a description**.

* Type the following description:

§ Using a random sampling of 100,000 records, this chart helps to visualize the relationship between merchant location and transactions flagged as fraudulent.

* **Save** the description.

Note

Aside from coding your own, you can also build more advanced visualizations by sharing a dataset through a plugin with a dedicated visualization tool like PowerBI or Tableau. To find out more, visit Visualization Plugins.

* Proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

## Build a Machine Learning Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#build-a-machine-learning-model "Permalink to this headline")

Congratulations! You have prepared the data and now you are ready to build your machine learning model!

You might recall from the Getting Started section that our main goal is to build a machine learning model to predict whether or not a credit card transaction is fraudulent.

The model we’ll be building is a binary classification model. To build it, we’ll be using Dataiku’s visual machine learning interface to perform AutoML.

At the end of this section, your Machine Learning flow zone will look like this:

### Remove a Column to Prepare the Dataset[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#remove-a-column-to-prepare-the-dataset "Permalink to this headline")

Recall that our dataset still contains the column, *authorized\_flag*. We’ll need to remove this column before training our model.

Having both columns in our training dataset would result in data leakage.

To remove *authorized\_flag*:

* Go to the **Flow**.

* Select the *transactions\_joined\_prepared* dataset and add a new **Prepare** recipe to it.

* Name the output `transactions\_before\_split`.

* Click **Create Recipe**.

* Select the column header of the *authorized\_flag* column and click **Delete**.

Dataiku adds the **Delete** step to the Script.

* **Save** and **Run** the recipe.

* When Dataiku finishes running the recipe, click **Explore dataset transactions\_before\_split**.

The *authorized* column remains in the new output dataset while the redundant column, *authorized\_flag*, has been removed.

The dataset is ready to be split into known and unknown transactions.

Note

Keep the Explore tab open to complete the next section.

### Split the Transactions into Known and Unknown[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#split-the-transactions-into-known-and-unknown "Permalink to this headline")

To train our machine learning model, we’ll first need to split our transactions into known (those where we know the value of the *authorized* column) and unknown (those where we want to predict the value of the *authorized* column). To do this, we’ll use the **Split** recipe.

#### Use a Split Recipe[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#use-a-split-recipe "Permalink to this headline")

To split the transactions into known and unknown:

* With the *transactions\_before\_split* dataset open, go to the right-side panel and select **Split** from the Visual recipes. Dataiku displays the **Input/Output** window of the Split recipe.

* Click **+Add** then add a new dataset named `transactions\_known`.

* Click **Create Dataset**.

* Click **+Add** then add another dataset named `transactions\_unknown`.

* Click **Create Dataset**. Dataiku is now ready to create two new output datasets.

* Click **Create Recipe**.

Dataiku displays the **Splitting** panel of the **Settings** step.

* Click **Define filters**.

* Define the filter to match rows that satisfy **the following conditions** where *authorized* **is defined**.

* Be sure the output dataset for this conditional filter is set to *transactions\_known*.

Dataiku will put the remaining rows, where *authorized* is not defined, into the *transactions\_unknown* dataset.

* **Save** and **Run** the recipe. After performing the split, the first dataset contains all transactions where the authorized flag is known (i.e., the transaction is either “authorized” or “fraudulent”). The second dataset contains all transactions where the authorized flag is not known (i.e., empty).

* Return to the **Flow**.

#### Move Flow Objects to a Flow Zone[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#move-flow-objects-to-a-flow-zone "Permalink to this headline")

Let’s move our new output datasets, as well as their parent recipe, the Split recipe, into the **Build a Machine Learning Model** Flow zone.

To do this:

* In the Flow, select the dataset, *transactions\_known*, hold down the **Shift** key, then select *transactions\_unknown*.

* Right-click *transactions\_unknown* to see the context menu.

* Choose *Move to a flow zone*.

* Select the **Build a Machine Learning Model** flow zone.

Dataiku lets us know that moving the dataset will also result in moving the Split recipe.

* Click **Move** to see the results.

The dataset, *transactions\_before\_split*, is part of both flow zones so Dataiku displays its icon with a dashed line.

### Train the Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#train-the-model "Permalink to this headline")

We want to train our machine learning model using the *transactions\_known* dataset so that we can later score the *transactions\_unknown* dataset.

Machine learning in Dataiku is a two-step process. First we explore models, design, train, and evaluate them in the **Lab**. Then, once we are satisfied with our best-performing model, we **Deploy** it from the lab to the Flow, where it appears as a **Saved model**.

Let’s get started!

* In the Flow, click the *transactions\_known* dataset once to select it.

* From the right-side panel, click on the **Lab**.

* Under **Visual ML**, choose **AutoML Prediction**.

* Set the target feature to *authorized*.

* With **Quick Prototypes** selected, click **Create**.

* Click **Train**. Dataiku begins building quick prototypes.

For this task, Dataiku trains two models by default: a random forest model and a logistic regression model.

Wait while Dataiku trains our model and displays results in the **Result** tab. The Result tab is where we inspect our model and assess its performance.

Note

In the Result tab, Dataiku keeps the history of all our trained models so that we can easily compare models side-by-side and reproduce results. This removes the burden of having to remember the feature selection methods used and model parameters specified alongside performance metrics.

If you ever want to pause and come back to the model building process later, you can find your quick modeling session by going to the Visual Analyses icon in the top navigation bar, then clicking on **Quick modeling of authorized on transactions\_known**. Once there, go to the **Models** tab.

### Assess the Model’s Performance[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#assess-the-model-s-performance "Permalink to this headline")

In this section, we’ll assess our model’s performance.

Building a machine learning model is an iterative process. Dataiku provides tools to help you assess your model’s performance and tune it.

#### View the Model Report[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#view-the-model-report "Permalink to this headline")

* When Dataiku finishes building the quick prototypes, explore the results.

In our **Result** tab, we can see two algorithms in **Session 1**. In this session, the random forest model is the higher-performing model based on the ROC AUC metric.

* Click **Random forest** to open it.

Dataiku displays the model’s Report page. This page includes interpretations such as the variable importance chart, and performance reports such as the confusion matrix.

Note

You can visit the course Machine Learning Basics to learn more about visual machine learning and to get hands-on practice with lessons like Hands-On: Evaluate the Model.

#### View the Confusion Matrix[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#view-the-confusion-matrix "Permalink to this headline")

One way to evaluate our model is by using a confusion matrix.

The **Confusion matrix** compares the actual values of the target variable with predicted values (hence values such as false positives, false negatives…) and some associated metrics: precision, recall, f1-score.

* Click **Confusion matrix** in the Performance section.

##### Interpreting the Confusion Matrix[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#interpreting-the-confusion-matrix "Permalink to this headline")

Our machine learning model is a binary classification model. It classifies credit card transactions as fraudulent (positive) or not (negative), where “not fraudulent” is “authorized”. Therefore, there are four possible outcomes:

* True Positive (TP): A transaction is classified as fraudulent (positive) and is actually positive.

* True Negative (TN): A transaction is classified as not fraudulent (negative), or authorized, and is actually not fraudulent.

* False Positive (FP): A transaction is classified as fraudulent (positive) and is actually not fraudulent (negative).

* False Negative(FN): A transaction is classified as not fraudulent (negative), or authorized, and is actually fraudulent (positive).

To help visualize the performance of the model, the Confusion matrix plots the four outcomes:

The decision chart can help us decide which evaluation metric Dataiku should use to evaluate our models.

#### Decide on an Evaluation Metric[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#decide-on-an-evaluation-metric "Permalink to this headline")

We need to compare models based on their performance. The default evaluation metric is accuracy. However, accuracy measures the number of correct predictions from **all** predictions made.

Since our training dataset is unbalanced (where only about 10% of the transactions are flagged as fraudulent), accuracy may not be the best metric for our model. We want to take into account false positives and false negatives. F1-score lets us do that.

In the next section, we’ll iterate on the design of our model in the **Design** tab. The Design tab is where we modify the model’s design in an attempt to improve the model’s performance. In the Design tab, we can also do things like specify F1-score as our evaluation metric.

### Iterate on the Model Design[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#iterate-on-the-model-design "Permalink to this headline")

To review the model Design:

* In the model report, click **Models** in the breadcrumb at the top.

* Go to the **Design** tab.

#### Configure the Train / Test Set[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#configure-the-train-test-set "Permalink to this headline")

* In the **Basic** section in the left-side panel, click **Train/Test Set**.

Dataiku displays the Train / Test Set settings.

You might recall that when we analyzed our *authorized\_flag* column using the Analyze tool, we discovered that only about 10% of the transactions were flagged as fraudulent. Since the default sampling method is the first 100,000 records, let’s try a class rebalancing sampling method to see if this results in model improvement.

Configure **Sampling & Splitting** as follows:

* Set the **Sampling method** to **Class rebalance (approx. nb. records)**

* Set the **Column** to use in rebalancing to *authorized*.

* **Save** your changes.

Note

If Dataiku displays the error, `Invalid argument`, letting you know the column chosen for class rebalancing does not exist, then go back to the **Sampling & Splitting** section and be sure you’ve chosen `authorized` as the **Column**.

#### Select an Evaluation Metric[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#select-an-evaluation-metric "Permalink to this headline")

* In the **Basic** section in the left-side panel, click **Metrics**.

* In **Hyperparameter optimization and model evaluation**, click **AUC** to view the available metrics, then choose **F1 Score**.

#### Retrain the Model[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#retrain-the-model "Permalink to this headline")

* Click **Save** in the top right corner of the Design window.

* Click **Train** to let Dataiku know we want to retrain the model using these new settings.

* Click **Train** again.

While Dataiku retrains the model, we can set the evaluation metric in the Result tab to F1-score.

* In the **Result** tab, click the **Metric** drop-down list, then choose **F1 Score**.

When model training is complete, we can see two algorithms in **Session 2**.

The random forest model from Session 2 is now the higher-performing model based on the F1-score metric.

The amount of time devoted to iterating the model design and interpreting the results can vary considerably depending on your objectives. We could continue to iterate on our model in an effort to improve its performance. We’ll use our best performing model from Session 2 and deploy it to the Flow.

### Deploy the Model to the Flow[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#deploy-the-model-to-the-flow "Permalink to this headline")

We’ll take our best performing model and use it to generate predictions for new data that the model has not seen.

Let’s deploy our random forest model from Session 2 to the Flow.

* Click the best performing model – the **Random Forest** model from Session 2 – to open it.

* Click **Deploy** near the top right corner, then click **Create**.

Dataiku deploys the model to the Flow. The Flow now looks like this:

Notice that there is now a training recipe (green circle), and a deployed model (green diamond) in the Flow.

In the next section, we’ll use the deployed model to score our unknown transactions.

Note

To learn more, visit the Machine Learning course on the Dataiku Academy (registration required).

### Score the Unknown Transactions Dataset[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#score-the-unknown-transactions-dataset "Permalink to this headline")

Now that we have a deployed model in the Flow, we can use it to generate predictions on new, unseen data. To do this, we’ll use the **Score** recipe. The Score recipe requires two inputs: a deployed model, and new, unseen data (*transactions\_unknown*).

Let’s use the model to predict if the unknown transactions are fraudulent or not.

* From the Flow, select the deployed model and add a **Score** recipe from the right-side panel.

* Choose *transactions\_unknown* as the input dataset.

* Name the output *transactions\_unknown\_scored*.

* Click **Create Recipe**.

* Click **Run**.

* Return to the Flow.

### Inspect the Scored Data[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#inspect-the-scored-data "Permalink to this headline")

Let’s look at the scored data and review the predictions.

* Open the transactions\_unknown\_scored dataset and observe the three new columns appended to the end:

+ *proba\_authorized* is the probability that a transaction is authorized.

+ *proba\_fraudulent* is the probability that a transaction is fraudulent.

+ *prediction* is the model’s prediction of whether the transaction is fraudulent.

Note

If we had used the *authorized\_flag* column instead of the *authorized* column, the two “proba” columns would have been named “proba\_0” and “proba\_1”. For our use case, the more descriptive columns are easier for us to interpret.

* Proceed to the next lesson.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

#### On Your Own[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#id18 "Permalink to this headline")

Try using the **Analyze** tool on the *prediction* column to see what percentage of the 2018 transactions are predicted to be fraudulent by the model. To do this, click the column header of the column then select **Analyze**. Choose **Whole data** instead of **Sample** to get statistics on the full dataset.

## Schedule the Build using a Scenario[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#schedule-the-build-using-a-scenario "Permalink to this headline")

In this section, we’ll create an automation scenario to automatically compute metrics, run checks, and rebuild the deployed model in the Flow.

### Configure Dataset Metrics and Checks[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#configure-dataset-metrics-and-checks "Permalink to this headline")

Metrics and checks are a crucial element used in automation and can help ensure the quality of a workflow.

Let’s say we expect to add new data to our workflow every day. Adding new data means that the percentage of new, unknown transactions–that is, those transactions with no value in the *authorized\_flag* column–will increase. We might want to be able to measure the number of “No value” transactions so that we can tell Dataiku what action we want it to take when our scenario runs. To do this, we need to define a metric.

Let’s take a look at the types of metrics that we’ll be checking for.

* Go to the **Flow**.

* Open the dataset, *transactions\_joined\_prepared*.

* Navigate to the **Status** tab.

* Click **Compute**. These metrics count the number of columns and records.

* Click the **Edit** subtab to display the **Metrics** panel.

Two built-in metrics have been turned on: Columns count and Records count. Metrics can be set to auto compute after the dataset is built.

Let’s create a new metric.

* Toggle the **Column Statistics** metric to **On**.

* Locate the *authorized\_flag* column, and select **Empty value count**.

* Click **Save**.

* Compute the metric by clicking **Click to run this now** > **Run**.

Dataiku displays the last run results.

Now we’ll add this new metric to **Metrics to display** and recompute the metrics.

* Navigate to the **Metrics** subtab again.

* Click the **Metrics selection** (the button that says **4/12 Metrics**) to view **Metrics display settings**.

* Add **Empty value count of authorized\_flag** to **Metrics to display**.

* Click **Save** to view the computed metrics.

Let’s add a check so we know when this metric falls outside an acceptable range. For our purposes, let’s say we only want to *automatically* rebuild the model if the count of “No value” records stays below 260,000 for now. This will help us monitor the number of records while we are designing the Flow.

* Open the **Edit** subtab again.

* Click to view the **Checks** panel.

* Add a new check to check when a **Metric Value is in a Numeric Range**.

* Name it `Check Count of No Authorized Flag Value`.

* The metric to check is **Empty value count of authorized\_flag**.

* Set the **Soft maximum** to `200000` and the **Maximum** to `260000`.

* Click **Save**.

The check is now configured to warn but not fail if the value is above the soft maximum. However, when we create a step in our scenario to run checks, we can tell Dataiku to consider warnings as failures.

In the next section, we’ll create an automation scenario to prompt Dataiku to run the metrics and checks.

### Create an Automation Scenario[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#create-an-automation-scenario "Permalink to this headline")

In this section, we’ll automate computing the metric and running the check we just created. When our check passes, our scenario will instruct Dataiku to rebuild the deployed model.

Note

**What are Scenarios?**

Automation scenarios are a set of actions that are scheduled to run when certain conditions are satisfied. For example, if you have new data that comes in regularly, such as once per day, you can create a scenario that runs the workflow once per day, or each time it detects a change in the dataset.

Together with metrics and checks, **scenarios** can be used to create validation feedback loops that automate many important parts of the project lifecycle. You can use scenarios to instruct Dataiku to initiate jobs using pre-defined triggers, such as a unit of time or modification of a dataset, or custom Python code. Examples of these jobs include rebuilding a dataset, retraining a model, or redeploying an application bundle. You can even add **reporters** to stay informed of scenario activity.

* Go to the **Flow**.

* From the **Jobs** dropdown in the top navigation bar, select **Scenarios**.

* Click **Create your first scenario**.

* Name your scenario `Build Deployed Model` and click **Create**.

#### Add a Trigger[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#add-a-trigger "Permalink to this headline")

Let’s start by adding a simple time-based trigger.

* Within the Triggers panel of the Settings tab, click the **Add Trigger** dropdown button.

* Add a **Time-based trigger**.

* Name it `Every 3 days`.

* Change **Repeat every** to 3 days.

* Set **Starting from** to the first of the month.

* Set **Run at** to **05:00 AM**.

* For the purposes of this tutorial, make sure the activity status toggle is **Off**.

Note

Depending on your use case, you could set the trigger to **On** and run frequently.

Optionally, you can add reporters to your scenario. Click **Add Reporter** to view the different types of reporters available. Visit Reporting on scenario runs for more information.

#### Add Steps[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#add-steps "Permalink to this headline")

Now that we have a trigger in place, we need to provide the steps that the scenario will take when a trigger is activated.

We’ll start by adding a step to compute dataset metrics.

* Navigate to the **Steps** tab.

* Click **Add Step** and select **Compute metrics** from the list.

* Name it `transactions\_joined\_prepared`.

* Add *transactions\_joined\_prepared* as the dataset to compute.

Now, we’ll add a step to run dataset checks.

* Click **Add Step** and select **Run checks**.

* Name it `transactions\_joined\_prepared`.

* Add *transactions\_joined\_prepared* as the dataset to check.

* Change the **Outcome on warnings** to **Failed** so that warnings are considered as failures.

Finally, let’s add a step to build the deployed model when all the dataset checks pass.

* Click **Add Step** and select **Build/Train**.

* Name it `deployed model`.

* Add the random forest model as the model to build.

Note

This step has **Build required datasets** as the build mode. This is the default build mode. To learn more, visit the Knowledge Base article, Can I control which datasets in my Flow get rebuilt during a scenario?

* Click **Run** to manually trigger the scenario.

* Navigate to the **Last runs** tab to see results of scenario runs.

## Join Datasets and Keep Unmatched Rows[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#join-datasets-and-keep-unmatched-rows "Permalink to this headline")

In this section, we’ll bring in a new version of the cardholder info dataset, which has differences in customer IDs and reward programs from the existing one, perform an inner join between this dataset and the existing one, and find all of the unmatched rows.

### Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#id19 "Permalink to this headline")

To complete this section of the tutorial, you’ll need the following:

* The cardholder\_info\_join\_scenario CSV Zip file.

* Dataiku version 11.3 or above. If you are using a previous version, you can perform the same type of join using the Join and keep unmatched plugin.

### Add a Flow Zone and Import Data[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#add-a-flow-zone-and-import-data "Permalink to this headline")

To keep our Flow organized, let’s create a new flow zone for this section of the course.

* In the Flow, create a new, empty flow zone called `Join and Keep Unmatched`.

* Right-click on the *cardholder\_info* dataset and move it to the new flow zone.

Next, import the new dataset that we’ll join with the cardholder info data.

* Download the cardholder\_info\_join\_scenario file.

* Back in the Flow, with the new flow zone selected, click **+Dataset** in the top right corner and select **Upload your files**.

* Drag and drop or select the *cardholder\_info\_join\_scenario* file in the space provided.

* Click **Create** to create the dataset and add it to the Flow.

This new dataset contains a *reward\_program* column with the value online\_shopping, which is not found in the existing *cardholder\_info* dataset. There also may be differences in customer IDs (internal\_card\_mapping) between the two datasets. We’ll capture unmatched rows from our join to find out.

### Join the Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#join-the-datasets "Permalink to this headline")

Next, create the join recipe with the two datasets and select the join conditions.

* In the Flow, select the *cardholder\_info* dataset and go to the right panel, then select **Join with…** under **Visual recipes**.

* In the **New join recipe** info window, add **cardholder\_info\_join\_scenario** as the second dataset, then select **Create recipe**.

Dataiku navigates to the join recipe settings, where it has preselected several columns to match on. We only want to keep two of these–*internal\_card\_mapping* and *reward\_program*.

* Click on the visual representation of the join conditions to open the conditions info window.

* Use the trashcan buttons to **delete** the join conditions *first\_active\_month*, *latitude*, and *longitude*.

* Check that the only two conditions left are *internal\_card\_mapping* and *reward\_program*.

* Click **OK**.

* Back in the **Join** step of the recipe settings, select **Inner join** from the join type dropdown menu, and select **Send unmatched rows to other output dataset(s)**.

You can save unmatched rows in output datasets when completing inner, left or right joins. Because this is an inner join, we have the option to save unmatched rows from both of our input datasets into two separate datasets. If this were a left join, we’d only be able to add a right unmatched dataset, and vice versa for right joins.

Let’s add two datasets to save unmatched rows from each input dataset.

* On the left, select **+Add dataset** and name the new dataset `cardholder\_info\_left\_unmatched`, then click **Use dataset**.

* Do the same on the right side, naming the dataset `cardholder\_info\_right\_unmatched`. The two datasets now appear in the join settings.

* Navigate to the **Selected columns** step of the recipe.

* On the right side, add the prefix `join` and select the first column *join\_internal\_card\_mapping* to include in the join results.

* **Save** and **Run** the recipe.

### Explore the Output Datasets[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#explore-the-output-datasets "Permalink to this headline")

The Flow Zone will contain three output datasets from the join recipe. Let’s explore them.

* Open *cardholder\_info\_joined* to explore it. This dataset contains the matching rows from both datasets. More specifically, this dataset contains every row from both input datasets where a match was found on both the customer ID (*internal\_card\_mapping*) and the reward program. Therefore, this dataset does not contain any rows with new customer IDs or new reward programs.

* Analyze the *reward\_program* column.

* Return to the Flow.

* Open *cardholder\_info\_left\_unmatched* to explore it. This dataset contains all of the rows from the first dataset that did not match the second dataset on both *internal\_card\_mapping* and *reward\_program*. Therefore, this dataset represents the customer IDs that are no longer part of the new cardholder dataset.

* Return to the Flow.

* Open *cardholder\_info\_right\_unmatched* to explore it. This dataset contains all of the rows from the second dataset that did not match the first dataset on both *internal\_card\_mapping* and *reward\_program*. Therefore, this dataset represents all new customer IDs, along with any existing customer IDs that are part of the new reward program, i.e. online\_shopping.

## What’s Next?[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#what-s-next "Permalink to this headline")

Congratulations! You have completed the tutorial! In a short amount of time, you were able to:

* connect to data by importing a CSV file;

* stack, join, and prepare data;

* create a lookup table;

* create a map and publish it to a dashboard;

* train a machine learning model and use it to make predictions on new data;

* collaborate with team members;

* configure an automation scenario, and

* find out new information about a dataset using a custom join.

Tip

To check your work or view the final project, visit the read-only completed version on the public Gallery.

This quick start tutorial is only the tip of the iceberg when it comes to the capabilities of Dataiku. To learn more, please visit the Academy, where you can find more courses, learning paths, and certifications to test your knowledge.

### Next Steps[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#next-steps "Permalink to this headline")

**Managing Where Projects are Stored**

Now you have a Dataiku Academy project on your Dataiku instance. You may want to manage where it is stored, and be able to easily locate it later on.

To do this, you can move the Dataiku Academy project into a project folder. Anyone on the instance can find the project using the Global Search, which searches across the Dataiku instance.

### Go Further[¶](https://knowledge.dataiku.com/latest/courses/quick-start/advanced-data-prep/index.html#go-further "Permalink to this headline")

**Replicating a Process for Less-Experienced Team Members**

You may want to replicate a process for less-experienced team members. By saving your project as a Dataiku application, you can allow team members who are not Dataiku users to replicate a process. For example, let’s say team members would like to analyze “Average of Authorized Flag by Card Location” rather than “Merchant Location”. To get hands-on practice with creating your own Dataiku application, visit the Dataiku Applications Tutorial.

**Operationalization**

The lifecycle of a data or machine learning project doesn’t end once a Flow is complete. Workflows and models need to be continuously improved and fed new data. You discovered how automation allows you to do this more efficiently in the design of the Flow. You can learn more by visiting the Automation course on Dataiku Academy or by browsing these topics and articles on the Dataiku Knowledge Base.
