The segmentation Zone is where the clustering takes place.

![Screenshot 2023-06-01 at 11.57.18.png](OzpKUVZoMlBS)

In the initial [Join Recipe](recipe:compute_customer_full_data), all four prepared dataset are brought together to create the [customer_full_data](dataset:customer_full_data) dataset on customer_id and reference_date. A post-filter is applied to keep only the active customers at a given reference_date, which corresponds to when nb_subscribed_products is defined. Depending on the number of product types, and the number of additional columns in the customer data and additional information datasets, this dataset might have variable sizes. A [Split Recipe](recipe:split_customer_full_data) is used separate historical data from the reference data, i.e., data where the reference_date is equal to the project_reference_date defined in the [Dataiku Application](article:11).

# Clustering

## Preprocessing

In the [analysis](analysis:MNRNuU9A) a preprocessing step is applied via a Script, which takes the log of the income, customer behavior, revenues, and balances variables. This technique is used to reduce outliers, since most of these variables exhibit log-normal distribution, with very high density around low values and a few very large values. Not transforming these variables might result in a cluster very difficult to interpret, with a huge cluster including all the low values and a separate very small cluster consisting of outliers. Clustering is not an exact science, and the way features are preprocessed and included in the analysis impacts significantly the results. Great care and attention must be taken to ensure business knoweldge is well reflected in the model.

## Features Handling

All features are handled similarly depending on whether they are numerical or categorical:

 - Categorical variables are dummy encoded
 - Numerical variables are rescaled using standard rescaling: each variable is subtracted by its mean and divided by its standard deviation. The rescaling is mandatory to have comparable distances for all variables. The standard rescaling keeps the property of having outliers for some variables because observation can be a few standard deviations away from the mean. Using min-max rescaling instead would result in having less variability in the distance between observations because the distance between two observations is capped at 1, as a consequence clusters will usually be less variable in size, but could also become less meaningful.

Choices can also be made on the set of variables that are included in the model. For instance, segmentation could be done by focusing mostly on revenues and ignoring demographics. Similarly, segments could be built by using behavioral data and dismissing revenues.
 
## Algorithm Definition

The Kmeans algorithm is parameterized in the algorithm menu, with the number of clusters that are set programmatically according to the value input in the [Dataiku Application](article:11). This number can be adjusted depending on how diverse the customers are. In the outliers detection screen, we choose not to detect outliers: we do not want to have customers not belonging to any of the segments. The choices in preprocessing and features handling made above should avoid needing to set a number of outliers, but if the user obtains for instance clusters with very few observations, it could make sense to allow outliers.

# Analysis

The screens inside the results view of the visual ML are also exported in the [dashboard](article:10) to be directly consumed. Please refer to the linked article for more details.

In the summary view, clusters have been automatically renamed based on the variables that mostly define them. However, the user has the possibility to summarize this information with a more human-readable name thanks to the edit button. Then rebuild the graph using the [scenario](scenario:REBUILD_GRAPHS) defined [here](article:12).

# Deployement

The model is deployed on the flow using the Deploy button and thus creates a [saved model](saved_model:cw5jnbmG). This saved model is then used to score the two datasets coming from the split. Scoring the reference data results in the [customer_reference_data_scored](dataset:customer_reference_data_scored) dataset which additionally contains a field with the segment name. And the [customer_historical_data_scored](dataset:customer_historical_data_scored) dataset contains the segment of all the historical data to broaden the analysis of the data that was not used in the model training.


