The Subscription prediction zone is where the prediction of the subscription probability is created. More complete details about this model are available [here](saved_model:S2oMnaYP).

![Screenshot 2023-10-05 at 11.18.11.png](DDMjBXlhC9Yy)

 **Note**: Typically, subscription datasets are unbalanced and contain a wealth of information. In some instances, this large amount of data can pose challenges for Visual ML when performing class rebalance sampling. To enhance solution performance, a series of Visual Recipes have been incorporated into the workflow to conduct undersampling before model creation.

The [split recipe](recipe:split_train_data) divides the observations that resulted in a subscription from those that did not, creating two separate datasets. In the following [filter recipe](recipe:compute_train_data_false_sample), 100k rows are randomly selected from the dataset that does not include any subscriptions and are then [stacked](recipe:compute_train_data_stacked) to the dataset including the subscriptions. 

The [classification model](saved_model:S2oMnaYP) first orders the training dataset by the observation date column. The model is then trained on the first 80% of the data points in the [train dataset](dataset:train_data) and tested on the remaining 20%. 

The [score recipe](recipe:score_to_predict_data_1) is used to apply the classification model to the dataset containing the values to predict ([to_predict_data](dataset:to_predict_data)) and generate the predicted values. 

## Hyperparameter and model evaluation

Model hyperparameters are optimized on the  **recall metric**.  Recall, also known as the true positive rate (TPR), is the percentage of observations that the model correctly identifies as belonging to the subscription class = 1 out of the total observations for that class.

When doing binary classification, most models don’t output a single binary answer but a continuous “score of being positive.” You then need to select a threshold on this score, above which to consider the sample as positive. In this solution, the threshold for scoring the target class is optimized according to the  **cost matrix**, putting more weight on observations belonging to the subscription class = 1 correctly classified and much less on those incorrectly classified.

## Features handling

Most columns from the [train dataset](dataset:train_data) are used as input in the classification model. Only the following features are automatically removed from the analysis: 
-  **customer_id ** (unique identifier)
-  **creation_date**  (used the account_age_at_observation_date feature instead)
-  **birth_date**   (used age_at_observation_date feature instead)
-  **first_name**  (irrelevant)
-  **last_name**  (irrelevant)
-  **start_date**  (empty if subscription = 0)
-  **end_date**  (empty if subscription = 0)
-  **product_id**  (irrelevant)
-  **revenue**  (empty if subscription = 0)
-  **balance**  (empty if subscription = 0)
-  **has_product**  (always 0)
-  **product_type**  (correlated with the product column)
