# Hands-On Tutorial: Column-Based Partitioning[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#hands-on-tutorial-column-based-partitioning "Permalink to this headline")

## Let’s Get Started![¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#let-s-get-started "Permalink to this headline")

In this tutorial, we will modify a completed Flow dedicated to the detection of credit card fraud. Our goal is to be able to predict whether or not a transaction is fraudulent based on the transaction subsector (e.g., insurance, travel, or gas) and the purchase date.

### Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#prerequisites "Permalink to this headline")

Your Dataiku instance must already have a SQL connection defined in order to complete this tutorial. For an overview of which databases are supported by Dataiku, see the SQL databases reference documentation.

### Objectives[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#objectives "Permalink to this headline")

Our Flow should meet the following business objectives:

* Be able to create time-based computations.

* Train a partitioned model to test the assumption that fraudulent behavior is related to merchant subsector, which is discrete information.

To meet these objectives, we will work with both discrete and time-based partitioning. Specifically, by the end of the lesson, we will accomplish the following tasks:

* Partition a dataset for the purpose of optimizing a flow and creating targeted features.

* Create the following partitions:

+ Time-based partitioning. We will partition by purchase date for the purpose of targeting a specific date.

+ Discrete partitioning. We will partition by subsector (e.g., internet, gas, or travel) for the purpose of targeting a specific subsector.

* Propagate partitioning in a Flow.

* Stop partitioning in a Flow, collecting the partitions.

### Project Workflow Overview[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#project-workflow-overview "Permalink to this headline")

The final pipeline in Dataiku is shown below.

## Create the Project[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#create-the-project "Permalink to this headline")

* From the Dataiku homepage, click **+New Project > DSS Tutorials > Advanced Designer > Advanced Partitioning: Column-Based (Tutorial)**.

Note

You can also download the starter project from this website and import it as a zip file.

### Explore the Flow[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#explore-the-flow "Permalink to this headline")

The Flow contains the following datasets:

* *cardholder\_info* contains information about the owner of the card used in the transaction process.

* *merchant\_info* contains information about the merchant receiving the transaction amount, including the merchant subsector (*subsector\_description*).

* *transactions\_* contains historical information about each transaction including the *purchase\_date*.

### Change the Dataset Connections[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#change-the-dataset-connections "Permalink to this headline")

To work with column-based partitioning, our datasets must have a SQL connection. In this step, we will change the connections of all our datasets from the local filesystem to the SQL connection defined on your Dataiku instance.

* Go to the Flow.

* Select all datasets except for the initial input datasets.

* Open the right panel, then click **Change connection**.

Dataiku displays the change connection dialog box.

* In **New connection**, select your SQL connection.

* Select **Drop data**. We won’t need the filesystem datasets.

* Click **Save**.

The Flow looks like this:

## Build a Time-Based Partitioned Dataset[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#build-a-time-based-partitioned-dataset "Permalink to this headline")

In this section, we will partition the dataset, *transactions\_copy*, using the *purchase\_date* dimension. The column, *purchase\_date*, contains values representing the date of the credit card transaction. For the purposes of this hands-on lesson, the number of distinct purchase dates has been limited to 15 values.

### Create a Partitioned Output[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#create-a-partitioned-output "Permalink to this headline")

In this section, we will create a partitioned dataset, *transactions\_partitioned\_by\_day*, without duplicating the data. To do this, we will create a new logical pointer to the same SQL table.

#### Create *transactions\_partitioned\_by\_day*[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#create-transactions-partitioned-by-day "Permalink to this headline")

* Run the Sync recipe that was used to create *transactions\_copy* so that the dataset is built.

* Open *transactions\_copy* and view the **Settings** page.

* Copy the contents of the **Table** and **Schema** fields, then save the information for later.

* Return to the Flow.

* In the upper-right-hand corner, click **+Dataset**.

* From the **SQL databases** menu, select your SQL connection.

Dataiku displays the configuration page for the SQL connection you selected.

* Paste the information you saved into the **Table** and **Schema** fields.

* Click **Test Table**.

Note

If the table is undefined, you can locate it by getting the list of tables.

* Next to **Partitioned**, click **Activate Partitioning** to enable column-based partitioning for this SQL dataset.

* Configure the dimension as follows:

+ Name the dimension, `purchase\_date`.

+ Set the dimension type to **Time range**.

+ Set the range to **Day**.

Let’s create this dataset.

* In the upper-right-hand corner, name the new dataset, `transactions\_partitioned\_by\_day`.

* Click **Create**.

#### Configure the Join Recipe[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#configure-the-join-recipe "Permalink to this headline")

Configure the Join recipe to change the input dataset from *transactions\_copy* to *transactions\_partitioned\_by\_day*:

* Run the Sync recipe that was used to create *merchant\_info\_copy*.

* Run the Sync recipe that was used to create *cardholder\_info\_copy*.

* Open the Join recipe that was used to create *transactions\_joined*.

* Click to replace the *transactions\_copy* dataset.

* Replace the *transactions\_copy* dataset with *transactions\_partitioned\_by\_day*.

* Click **Replace Dataset**.

* In the **Input / Output** tab, set the partition dependency function type to “All available”.

* Save and run the recipe.

* Return to the Flow.

Our Flow now looks like this:

## Propagate the Time-Based Partition Dimension Across the Flow[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#propagate-the-time-based-partition-dimension-across-the-flow "Permalink to this headline")

Now that our dataset is partitioned, we can propagate the partitioning dimension across the Flow.

### Partition the *transactions\_joined* Dataset[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#partition-the-transactions-joined-dataset "Permalink to this headline")

Partition *transactions\_joined* and identify the partition dependencies using the following steps:

* Open the *transactions\_joined* dataset.

* View **Settings** > **Connection**.

* Click **Activate Partitioning**.

* Configure the dimension as follows:

+ Name the dimension, `purchase\_date`.

+ Set the dimension type to **Time range**.

+ Set the range to **Day**.

* Click **Save** and click **Confirm** to confirm that dependencies with this dataset will be changed.

The datasets, *transactions\_partitioned\_by\_day* and *transactions\_joined* are now both partitioned by the *purchase\_date* dimension.

### Identify the Partition Dependencies[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#identify-the-partition-dependencies "Permalink to this headline")

Now we must identify which partition dependency function type (e.g., All available, Equals, Time Range, etc.) we want to map between the two datasets, *transactions\_partitioned\_by\_day* and *transactions\_joined*.

To do this, modify the Join recipe:

* Open the Join recipe and view the **Input / Output** tab.

By default, Dataiku sets the partition mapping so that the output dataset is built, or partitioned, using all available partitions in the input dataset.

To optimize the Flow, we only want to target the specific partitions where we are asking the recipe to perform computations.

* In the partitions mapping, choose “Equals” as the partition dependency function type.

* View the **Settings** tab.

* Click the **Recipe run options** icon next the **Run** button, and select **Specify explicitly**

* Type the value, `2017-12-20,2017-12-21,2017-12-22,2017-12-23` then click **OK**.

* Without running the recipe, click **Save** and accept the schema change.

The recipe is now configured to compute the Join only on the rows belonging to the dates between “2017-12-20” and “2017-12-23”.

* Do not run the recipe yet.

## Run a Partitioned Job[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#run-a-partitioned-job "Permalink to this headline")

Let’s run a partitioned job.

* Return to the Flow.

* Right-click *transactions\_joined* and select **Build** from the menu.

* Choose a **Recursive** build.

* Select **Smart reconstruction**.

* Click **Build Dataset**.

### View Job and Activity Log[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#view-job-and-activity-log "Permalink to this headline")

Let’s view the job activity log to reveal how partitioned SQL jobs are managed.

* View the most recent **Job**.

* View the job Activities.

The **Activity Log** for the partition “2017-12-20” contains the following queries:

* A query to delete rows from *transactions\_joined* where the purchase date is “2017-12-20”.

* A query to insert the rows belonging to the new partition into *transactions\_joined*.

### View Partitions Count in the Flow[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#view-partitions-count-in-the-flow "Permalink to this headline")

We can use a Flow view to discover whether or not the job is completed.

* Return to the Flow.

* From the Flow view, select to visualize the **Partitions count**.

The table, *transactions\_partitioned\_by\_day* has zero partitions. To update this count:

* Select *transactions\_partitioned\_by\_day*.

* In the right panel, click **Update status (count of records, file size)** to update the count.

* Wait while Dataiku updates the count of partitions, then refresh the page. The *transactions\_partitioned\_by\_day* contains 15 time-based partitions.

Close the **Partitions count** view to restore the default view.

## Build a Non-Partitioned Output Dataset[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#build-a-non-partitioned-output-dataset "Permalink to this headline")

We will now stop the partitioning of our *purchase\_date* dimension and create a non-partitioned dataset. To do this, Dataiku DSS uses a “partition collection” mechanism during runtime. This mechanism is triggered by the partition dependency function type, “All available”.

This process is referred to as partition collecting. Partition collecting can be thought of as the reverse of partition redispatching.

* Open the Prepare recipe that is used to create *transactions\_joined\_prepared*.

* In the **Input / Output** tab, ensure the partition dependency function type is set to “All available”.

* In the **Script** step, run the recipe.

“All Available” means all of the available input partitions will be processed when we run the recipe. The output dataset, *transactions\_joined\_prepared*, is not partitioned.

## Build a Discrete-Based Partitioned Dataset[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#build-a-discrete-based-partitioned-dataset "Permalink to this headline")

### Create a Partitioned Output[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#id3 "Permalink to this headline")

Recall from the business objectives that we want to be able to predict whether or not a transaction is fraudulent based on the transaction subsector (e.g., insurance, travel, and gas) and the purchase date. This is discrete information.

To meet this objective, we need to partition the Flow with a discrete dimension. We want the input dataset, *transactions\_joined\_prepared*, of the Window recipe to be partitioned by merchant subsector.

#### Create *transactions\_joined\_prepared\_windows*[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#create-transactions-joined-prepared-windows "Permalink to this headline")

The input dataset in our Window recipe, *transactions\_joined\_prepared*, is not partitioned. To partition this dataset, we can follow the same steps we followed when we created our time-based partitioned outputs. By creating a new, logical pointer, we can create a new dataset without duplicating the data.

* Open *transactions\_joined\_prepared* and view the **Settings** page.

* Copy the contents of the **Table** and **Schema** fields and save the information for later.

* Return to the Flow.

* In the upper-right-hand corner, click **+Dataset**.

* From the **SQL databases** menu, select your SQL connection.

* Paste the information you saved from the *Table* and *Schema* fields.

* Click **Test Table**.

* Next to **Partitioned**, click **Activate Partitioning** to enable discrete-based partitioning for this SQL dataset.

* Name the dimension, `merchant\_subsector\_description`.

Let’s create this dataset.

* In the upper-right-hand corner, name the new dataset, `transactions\_partitioned\_by\_sector`.

* Click **Create**.

Our Flow now looks like this :

#### Configure the Window Recipe[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#configure-the-window-recipe "Permalink to this headline")

Configure the Window recipe to change the input dataset from *transactions\_joined\_prepared* to *transactions\_partitioned\_by\_sector*.

* Open the Window recipe.

* In the **Input / Output** tab, change the *transactions\_joined\_prepared* dataset to *transactions\_partitioned\_by\_sector*.

* Save and run the recipe.

* Return to the Flow.

Our Flow now looks like this:

### Recap[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#recap "Permalink to this headline")

Let’s look at this in more detail.

Data is written into the SQL table, *${projectKey}\_transactions\_copy*. The dataset, *transactions\_copy* is linked to this table.

A connection is created to the dataset, *transactions\_partitioned\_by\_day*. This dataset is also linked to the table, *${projectKey}\_transactions\_copy*, but it is partitioned with the “purchase\_date” dimension. Therefore, downstream datasets can also be partitioned by *purchase\_date*.

Data is then written in the SQL table, *${projectKey}\_transactions\_joined\_prepared*. The dataset, *transactions\_joined\_prepared* is linked to this table.

Finally, a connection is created to the dataset, *transactions\_partitioned\_by\_sector*. This dataset is also linked to the table *${projectKey}\_transactions\_joined\_prepared*, but it is partitioned by merchant subsector. Therefore, downstream datasets can also be partitioned by merchant subsector.

### View Partitions Count in the Flow[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#id4 "Permalink to this headline")

* Return to the Flow.

* From the Flow view, select to visualize the **Partitions count**.

* Select *transactions\_partitioned\_by\_sector* and update the count of records.

* Wait while Dataiku updates the count of partitions, then refresh the page.

The *transactions\_partitioned\_by\_sector* dataset contains 9 partitions.

* Close the Flow view.

## Propagate the Discrete Partitioning through the Flow[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#propagate-the-discrete-partitioning-through-the-flow "Permalink to this headline")

Now you can apply what you’ve learned to propagate the discrete partitioning through the Flow. To do this, follow the same process you used to propagate the time-based partitioning through the Flow.

* Run the Split recipe.

* Partition the following datasets by the *merchant\_subsector\_description* dimension:

+ Partition *transactions\_joined\_prepared\_windows*.

+ Partition *transactions\_known*.

+ Partition *transactions\_unknown*.

+ Partition *transactions\_unknown\_scored*.

Congratulations! You have completed this hands-on lesson.

## What’s Next?[¶](https://knowledge.dataiku.com/latest/courses/advanced-partitioning/hands-on-sql-based-partitioning.html#what-s-next "Permalink to this headline")

Visit the course, Partitioned Models, to learn more about training a machine learning model on the partitions of a dataset.
