# Hands-On Tutorial: Join Datasets[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#hands-on-tutorial-join-datasets "Permalink to this headline")

In this hands-on lesson, we’ll demonstrate a key visual recipe in Dataiku: the **Join** recipe, which allows you to enrich your data with columns from another dataset.

## Resume/Create Your Project[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#resume-create-your-project "Permalink to this headline")

If you completed all of the steps in the Basics 102 project, you can resume the same project for this lesson. All you need to do is download a copy of the customers CSV file and upload it to the project.

Alternatively, you can create a starter project with these same steps completed. From the Dataiku homepage, click **+New Project > DSS Tutorials > Core Designer / Basics > Basics 103**.

Note

You can also download the starter project from this website and import it as a zip file.

In the top navigation bar, click on the **Flow** menu.

## Join Datasets[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#join-datasets "Permalink to this headline")

In Basics 102, we created a dataset of orders grouped by unique customers. Now we have a dataset with more information about our customers. We can use the **Join** recipe to enrich the *customers* dataset with the information about the aggregate orders customers have made from the *orders\_by\_customer* dataset.

Hint

A screencast at the end of the page recaps the instructions described here.

### Create the Join Recipe[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#create-the-join-recipe "Permalink to this headline")

First, let’s create the recipe. To do so, follow the procedure below.

* Open the *customers* dataset by double-clicking on its icon in the **Flow**. Each row in this dataset represents a separate customer, and records:

>

>

> 	+ the unique customer ID

> 	+ the customer’s gender

> 	+ the customer’s birthdate

> 	+ the user agent most commonly used by the customer

> 	+ the customer’s IP address

> 	+ whether the customer is part of Haiku T-Shirts’ marketing campaign

>

Note

Take a few minutes to explore it with tools like Analyze. Also, note the gray portion of the *gender* column’s data quality bar representing missing values.

* From the **Actions** tab in the right panel, choose **Join with…** from the list of visual recipes.

* Select **orders\_by\_customer** as the second input dataset.

Hint

Although only two datasets can be added in the Join recipe creation dialog, more datasets can be added at the **Join** step after creating the recipe.

* Change the name of the output dataset to `customers\_orders\_joined`.

* Click **Create Recipe**.

### Configure the Join Recipe[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#configure-the-join-recipe "Permalink to this headline")

Let’s now configure the Join recipe.

As you can see, this recipe includes several steps (shown in the left navigation bar).

#### Define the Join Condition[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#define-the-join-condition "Permalink to this headline")

The core step is the **Join** step, where you choose how to match rows between the datasets. In this case, we want to match rows from the *customers* and *orders\_by\_customer* datasets that have the same value of *customerID* and *customer\_id*, respectively.

Hint

Notice the **+** button at the top right of each dataset in the Join step. You can use this button to add more datasets to join with the *customers* and the *orders\_by\_customer* datasets.

* Click on **Add a condition** to tell Dataiku which columns to match.

It opens the Join conditions dialog, where Dataiku automatically recognizes that the ID columns are the join key, even though they have different names. This is the only condition we need to add here.

* Select **OK** and return to the Join recipe.

#### Set the Join Type[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#set-the-join-type "Permalink to this headline")

By default, the Join recipe performs a left join, which retains all rows in the left dataset, even if there is no matching information in the right. Since we only want to work with customers who have made at least one order, let’s modify the join type.

Note

**Types of joins**

There are multiple methods for joining two datasets; the method you choose will depend upon your data and your goals in analysis.

* **Left join** (default type) keeps all rows of the left dataset and adds information from the right dataset when there is a match. This is useful when you need to retain all the information in the rows of the left dataset, and the right dataset is providing extra, possibly incomplete, information.

* **Inner join** keeps only the rows that that match in both datasets. This is useful when only the rows with complete information from both datasets will be useful downflow.

* **Outer join** keeps all rows from both datasets, combining rows where there is a match. This is useful when you need to retain all the information in both datasets.

* **Right join** is similar to a left join, but keeps all rows of the right dataset and adds information from the left dataset when there is a match.

* **Cross join** is a Cartesian product that matches all rows of the left dataset with all rows of the right dataset. This is useful when you need to compare every row in one dataset to every row of another.

* **Advanced join** provides custom options for row selection and deduplication for when none of the other options are suitable.

**Keeping unmatched rows**

When performing a left, right, or inner join in Dataiku versions 11.3 and above, you can add a dataset to capture the unmatched rows.

* Click on the **Left Join** indicator.

* Select **Inner join**.

This will retain only the customers who have made an order, and remove the others from the output dataset.

#### Define the Columns to Keep in the Output Dataset[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#define-the-columns-to-keep-in-the-output-dataset "Permalink to this headline")

The next step is to choose which columns to retain from the input datasets. We want to carry over all columns from both datasets into the output dataset, with the exception of *customer\_id* (since the *customerID* column from the *customers* dataset should be sufficient).

* Navigate to the **Selected columns** step.

* Deselect the *customer\_id* column in the *orders\_by\_customer* dataset.

* Select **Run** to execute the recipe.

When it is done, click **Explore dataset customers\_orders\_joined** at the bottom of the screen to explore the *customers\_orders\_joined* dataset.

*The following video goes through what we just covered.*

## What’s Next?[¶](https://knowledge.dataiku.com/latest/courses/lab-to-flow/enrich/enrich-dataset.html#what-s-next "Permalink to this headline")

So far all of your work has been in the Flow. Now it’s time to learn about the Lab!
