# Value Clustering & Train/Test Split

Train a clustering algorithm on average monthly value and apply it to the current and future CLV.

![train:test_split.png](laOhYSzBkH5w)

Since we compare the CLV across two durations (the lookback window for the current_clv and both the lookback and the forward window for the future_clv), we can’t have a consistent clustering on the direct value (the future value will be shifted into higher values).

To solve this and have a system resilient to asymmetrical windows, we apply the clustering on the average monthly value per customer. Using the average value also introduces a dynamic aspect to the value of a customer (to distinct customers with a high or low rate of increase of value).

We have two clustering methods available, Kmeans and quantiles : [Clustering](article:3)


We perform a split stratified on the current_clv_cluster to keep the same distribution in both the train and the test dataset.

For performance purposes and to manage possible scaling issues, you the number of records in train/test dataset in the Dataiku application. It will randomly sample the entire dataset before training the clustering algorithm and performing the split.

When using quantiles, groups are not always balance because of repeated vales (especially customers with 0 future value)