In this article, we will walk through each of the steps necessary to create an initial set of segmentation insights using this project. We will use a demonstration dataset as input, the configuration will happen in the [Project Setup](article:11), and we will use the [dashboards](article:10) to understand our model. The project you are currently viewing is a completed version, showing all results and outputs discussed below. If you wish to create your own analysis using your own data, please create an instance of the Dataiku Application.

# User Story

As a Data Scientist or Data Analyst, bring together existing business knowledge and machine-learning generated segments, and help marketing and sales teams immediately benefit from the incremental value of an ML approach while preserving continuity with established methodologies and analytics.

## Unsupervised Learning

[Unsupervised learning](https://en.wikipedia.org/wiki/Unsupervised_learning) (clustering) is a type of machine learning approach that attempts to find patterns in unlabelled data. 'Unlabelled' means there are provided 'true' or 'known' values about which clusters each customer should belong to. The unsupervised learning algorithm will attempt to group observations that are statistically similar. Clustering can be achieved in Dataiku without writing any line of code by using [Visual ML](https://doc.dataiku.com/dss/latest/machine-learning/unsupervised/index.html). By using this approach the project performs a purely data-oriented segmentation of customers.

### K-Means clustering

One of the most popular algorithms for clustering is [K-means clustering](https://en.wikipedia.org/wiki/K-means_clustering). This algorithm requires the user to set the number of clusters desired (based on best practice, or an informed sense of how many clusters seem plausible), and will then perform optimization to find that number of clusters. The algorithm can be thought of as working iteratively, running the two steps below one after the other until achieving convergence. First, some means are defined for each of the clusters, then:

- Assignment: each observation is assigned to the cluster associated with the closest mean.
- Update: recompute means of each of the clusters.

The algorithm may fall into a local optimum, resulting in less useful outputs, but usually yields quick and sensible results. Feel free to experiment with different numbers of clusters, or different input variables.

### Cluster analysis

Performance metrics for clustering problems are not as straightforward as for prediction tasks since there is no ground truth to compare against. Therefore business expertise should be brought in to assess the quality and sense of the clustering



# Data

This use case starts with five already prepared datasets covering 16 months of historical banking data that include:
 - A revenue dataset of more than 600k rows.
 - A balance dataset of more than 400k rows.
 - A product holdings dataset of more than 50k rows including 4 distinct product types and 10 distinct products.
 - A customer dataset of around 22k distinct customers.
 - An additional information dataset of more than 300k rows.

The [Data Model](article:5) article describes the necessary format criteria to be met.

To configure and execute the process, we will follow the steps in the [Project Setup](article:11).

The case study data is already in filesystem format, so we can skip the upload or connection steps and move straight to prediction and analysis parameters. If you're using your data, you'll need to upload or connect it before proceeding.

# Insights

 - Easy to integrate into existing customer analytics
 - Discover new insights with machine learning, without specialized data science skill
 - Understand current and historical trends quickly using interactive visualizations.
