# How to Export Preprocessed Data[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#how-to-export-preprocessed-data "Permalink to this headline")

To train a machine learning model, Dataiku modifies the input data you provide and uses the modified data, known as preprocessed data. You may want to export the preprocessed data and inspect it, such as when you want to investigate issues or perform quality checks.

In this article, we’ll show you how to export the preprocessed dataset using Python code in a Jupyter notebook.

Note

Dataiku comes with a complete set of Python APIs.

## Let’s Get Started![¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#let-s-get-started "Permalink to this headline")

In this article, we’ll work with an example of a project that contains a deployed model in the Flow. To follow along with the steps, you can use any project with a deployed model in the Flow.

## Create a Code Notebook[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#create-a-code-notebook "Permalink to this headline")

From your project, create a new Jupyter notebook.

### Load the Input Dataframe[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#load-the-input-dataframe "Permalink to this headline")

We’ll start by using the Dataiku API to get the input dataset for our model, as a pandas dataframe.

* Replace `saved\_model\_input\_dataset\_name` with the name of your model’s input dataset.

§ import dataiku

§ # Load the dataframe for the input data

§ input\_dataset = dataiku.Dataset("saved\_model\_input\_dataset\_name")

§ input\_dataframe = input\_dataset.get\_dataframe(limit=100000)

### Load the Predictor API for Your Saved Model[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#load-the-predictor-api-for-your-saved-model "Permalink to this headline")

Next, we’ll use the predictor API to preprocess the input dataframe. The predictor is a Dataiku object that allows you to apply the same pipeline as the visual model (preprocessing + scoring). For more information, visit Interaction with saved models.

* Replace `saved\_model\_id` with the ID of your saved model.

§ # Get the model and predictor

§ model = dataiku.Model("saved\_model\_id")

§ predictor = model.get\_predictor()

### Preprocess the Input Dataframe[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#preprocess-the-input-dataframe "Permalink to this headline")

The model’s predictor has a `preprocess` method that performs the preprocessing steps and returns the preprocessed version of the data.

§ # Use the predictor to preprocess the data

§ preprocessed\_data, preprocessed\_data\_index, is\_empty = predictor.preprocess(input\_dataframe)

### Examine the Dataframes[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#examine-the-dataframes "Permalink to this headline")

The original `input\_dataframe` is a pandas dataframe containing the data from your input dataset. We can print it out to see this:

§ print(input\_dataframe)

The `preprocessed\_data` variable is a list of lists containing the preprocessed version of this input dataset:

§ print(preprocessed\_data)

The names of these columns are the model features, which we can get from the predictor:

§ features = predictor.get\_features()

§ print(features)

Each string in the `features` list corresponds to one column in the `preprocessed\_data`. We can compare these features with the list of column names from the input dataset:

§ print(list(input\_dataframe.columns))

#### Comparing the Preprocessed Data and Input Data[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#comparing-the-preprocessed-data-and-input-data "Permalink to this headline")

The number of features (and so the number of columns in `preprocessed\_data`) might be different from the number of columns in the `input\_dataframe`; and the names of some features might be different from the column names in the input dataset. This is because the feature handling settings of your model training can remove columns from the dataset and can add new features.

In addition, the number of rows in the preprocessed data can be fewer than the number of rows in the input dataset. This is also caused by the feature handling settings used when training the model. For example, rows with an empty target value can be dropped.

The `preprocessed\_data\_index` returned by the `preprocess` method shows you which rows from the input dataset have been used to produce the preprocessed data.

Note

For more information about feature handling, see Concept: Feature Handling.

To make the preprocessed data easier to compare with the input data, we can turn it into a pandas dataframe with column headings:

§ import pandas as pd

§ preprocessed\_dataframe = pd.DataFrame(preprocessed\_data, columns=features)

§ print(preprocessed\_dataframe)

If you just want to look at the preprocessed data or perform simple calculations on it, then these steps may be sufficient. However, if your goal is to perform complex analyses on the preprocessed data, you should export the preprocessed data to a new dataset in Dataiku. We’ll do this in the next section.

### Export the Preprocessed Data to a New Dataset[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#export-the-preprocessed-data-to-a-new-dataset "Permalink to this headline")

The previous steps allowed us to access the preprocessed data as a pandas dataframe in a Python notebook. This can be useful for many applications, but in order to use the full power of Dataiku to analyze the preprocessed data, we can export it to a new dataset in the Flow.

#### Create a New, Empty Dataset[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#create-a-new-empty-dataset "Permalink to this headline")

First, we’ll create a new dataset. The following code snippet uses the Dataiku API to create a new dataset if it does not already exist. That way, we can re-run our code and overwrite the dataset with updated data.

* Replace `project\_name` with the name of your project.

* Replace `my\_preprocessed\_data` with the name you choose for your new dataset.

§ # Create a new dataset (if necessary)

§ client = dataiku.api\_client()

§ project = client.get\_project("project\_name")

§ preprocessed\_dataset\_name = "my\_preprocessed\_data"

§ preprocessed\_dataset = project.get\_dataset(preprocessed\_dataset\_name)

§ if not preprocessed\_dataset.exists():

§ print("Creating new dataset:", preprocessed\_dataset\_name)

§ builder = project.new\_managed\_dataset(preprocessed\_dataset\_name)

§ builder.with\_store\_into("filesystem\_managed")

§ dataset = builder.create()

§ else:

§ print("Overwriting existing dataset:", preprocessed\_dataset\_name)

#### Fill the Empty Dataset with the Preprocessed Data[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#fill-the-empty-dataset-with-the-preprocessed-data "Permalink to this headline")

Now that our empty dataset has been created, we can fill it with the preprocessed data.

§ # Write the preprocessed data to the dataset

§ preprocessed\_dataset.get\_as\_core\_dataset().write\_with\_schema(preprocessed\_dataframe)

This creates a new dataset in the Flow containing the preprocessed data for our model.

Note

This new dataset is not linked to your model. If you modify your original dataset or retrain your model, you’ll need to re-run the code in your notebook to update the preprocessed dataset.

## What’s Next?[¶](https://knowledge.dataiku.com/latest/kb/code/python/preprocessed-data-export.html#what-s-next "Permalink to this headline")

You can now use all the features of Dataiku to analyze this dataset. For example, you can:

* Explore this dataset using the Dataiku UI, to analyze columns, compute dataset statistics, and create charts.

* Use this dataset as part of a dashboard.

* Use this dataset as the input to a recipe.

To automate updating the preprocessed dataset you could create a scenario and add a step to execute Python code. You could also create a code recipe from your code notebook.

* To find out more about working with code notebooks in Dataiku, you can try the Hands-On Tutorial: Code Notebooks.

* To learn about Python recipes, visit Concept: Code Recipes in Dataiku.

* To learn about scenarios, visit Concept: Scenarios.
