# Recognizing Handwritten Digits[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#recognizing-handwritten-digits "Permalink to this headline")

To demonstrate how to build a model from scratch, we’ll use the MNIST database. The goal is for the model to be able to identify a handwritten number.

## Preparing the Data[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#preparing-the-data "Permalink to this headline")

Download the PNG version of the database. Create a new project. Create a new folder *mnist\_png* in the flow and populate it with the *mnist\_png.tar.gz* file.

Tip

You can also use the Download recipe to create the folder and populate it with the contents of the archive. To do this, go to the Flow, and click the **+Recipe** button. Select **Visual** > **Download**. When configuring the recipe, select **+Add a First Source** and type the URL `https://github.com/myleott/mnist\_png/raw/master/mnist\_png.tar.gz`.

In order to train and test a deep learning model for these images, we need to create train and test datasets that contain the path to each image (so that the model can find it) and the label that identifies the digit each image represents.

As a first step, we’ll create a Python recipe that uses the *mnist\_png* folder as an input and a new dataset *mnist* as the output. The code of the recipe uses the Dataiku Python API to retrieve the list of paths of all images in the folder (and its subfolders) and writes the paths as a column in the output dataset.

Note

In order to use this code in your project, you’ll need to change “9672PoPB” to the identifier for the folder in your project.

§ # -\*- coding: utf-8 -\*-

§ import dataiku

§ import pandas as pd, numpy as np

§ from dataiku import pandasutils as pdu

§ # Read recipe inputs

§ folder = dataiku.Folder("9672PoPB")

§ # Initialize data frame

§ df = pd.DataFrame(columns=['path'])

§ # Populate dataframe with paths

§ df['path'] = folder.list\_paths\_in\_partition()

§ # Write recipe outputs

§ mnist = dataiku.Dataset("mnist")

§ mnist.write\_with\_schema(df)

Our next step is to use a Prepare recipe to extract the label from the image path and whether the image belongs to the train or test sample.

Next, we’ll use a Split recipe to split the records into the train and test datasets. Now the data is ready to build a deep learning model for image classification.

## The Deep Learning Model[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#the-deep-learning-model "Permalink to this headline")

Create a Visual Analysis for the training dataset (from the dataset’s Actions menu, Lab > Visual Analysis). In the Script tab of the Visual Analysis, change the Design Sample to ensure that all 10 digits are represented in the sample. Since the training sample is only 60,000 records, you can simply choose the first 60,000 records. Save and refresh the Design Sample.

Next, create a new model with:

* **Prediction** as the task,

* *label* as the target variable

* **Deep learning** as the Expert mode, then click **Create**

This creates a new machine learning task and opens the Design tab for the task. On the Target panel, Dataiku DSS may identify this as a Regression type of ML task because *label* is a numeric column with many unique values. Change the prediction type to Multiclass classification.

Dataiku may display a message letting you know that no input feature is selected. We’ll resolve this error by configuring feature handling.

### Features Handling[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#features-handling "Permalink to this headline")

On the Features Handling panel, turn on *path* as an input, and select **Image** as its variable type.

Select the folder that contains the image archive. *IMPORTANT:* the trained model will look for images in this directory. If we want to score new handwritten digits, they will need to be placed in this folder.

We won’t use the default code, so just remove all the code. Then, click on {} Code Samples on the top right and select the **Default Keras preprocessing for image** code sample.

Insert the Keras code then change the resized width and height variables from `197` to `28`.

### Deep Learning Architecture[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#deep-learning-architecture "Permalink to this headline")

We now have to create our network architecture in the `build\_model()` function. We won’t use the default architecture, so just remove all the code. Then, click on **{} Code Samples** on the top right and search for “images”. Select the **CNN** architecture for image classification.

Insert the CNN code then click on **Display inputs** on the top left. You should see that the “main” feature is empty because we are only using the image data, which is in the input path\_preprocessed.

In order to build the model, we need to make a few changes to the code.

* In the line that defines `image\_shape` change `197, 197, 3` to `28, 28, 3`

* In the line that defines `image\_input\_name`, change `name\_of\_your\_image\_input\_preprocessed` to `path\_preprocessed`.

* The code sample defines a fairly complex CNN with several hidden layers and a large number of nodes within each layer. Unless you have access to a GPU, erase all of the code in the `build\_model()` function between the comment `DEFINING THE ARCHITECTURE` and the `return model` line, and replace it with the following:

§ x = Conv2D(32, kernel\_size=3, padding='same', activation='relu')(image\_input)

§ x = Conv2D(32, kernel\_size=3, padding='same', activation='relu')(x)

§ x = MaxPooling2D(pool\_size=(2, 2))(x)

§ x = Flatten()(x)

§ x = Dense(128, activation='relu')(x)

§ x = Dropout(0.2)(x)

§ predictions = Dense(n\_classes, activation='softmax')(x)

§ model = Model(inputs=image\_input, outputs=predictions)

This should be sufficient to create a good model for the images that won’t take all night to build on a laptop.

## Model Results[¶](https://knowledge.dataiku.com/latest/courses/advanced-analytics/image-classification-code/handwritten-digits.html#model-results "Permalink to this headline")

Click **Train** and, when complete, deploy the model to the flow, create an evaluation recipe from the model, and evaluate on the test data. In our example, the model has an accuracy of about 98.7%. Your results will vary from this example.
