# Flow zone presentation

The [real_estate_pricing](flow_zone:xyEwdSK) zone is dedicated to the training of the machine learning model in charge of predicting the price of any property.

![real-estate-pricing.png](4pyZFK83rD1s)

Recipe [split_properties_with_geospatial_features_windows](recipe:split_properties_with_geospatial_features_windows) split the data coming from [properties_windows_prepared](dataset:properties_windows_prepared) into [train_set](dataset:train_set) (80% of the data) and [test_set](dataset:test_set) (the remaining 20%). The split is done after first sorting the data by ascending transaction dates because we want our model to be able to do accurate predictions on subsequent unseen data.


The model is trained on [train_set](dataset:train_set). We want our model to be accurate no matter the price of the property, so we trained it to optimize the **[mean absolute percentage error](https://en.wikipedia.org/wiki/Mean_absolute_percentage_error) (mape)**.  Once trained, we then deployed it in the flow as the [real_estate_pricing_model](saved_model:GxEFXycR). We use it with the evaluate recipe [evaluate_on_test_set](recipe:evaluate_on_test_set) that is in charge of: 
- Applying predictions on the [test_set](dataset:test_set), leading to the [test_set_scored](dataset:test_set_scored) dataset.
- Historizing the [test_set](dataset:test_set) scoring metrics in the [test_set_prediction_metrics](dataset:test_set_prediction_metrics) dataset. The idea is that if the flow was fed with new data, we would like to use this evaluate recipe to see if the model performances are drifting or not. This is what is done by the two last recipes:
  - The Python recipe [compute_first_and_last_metrics](recipe:compute_first_and_last_metrics) aggregates the data in a single row containing the model performances at the first scoring date and at the last one.
  - The prepare recipe [compute_model_drift](recipe:compute_model_drift) then computes the drift between the first and the last deployment. Here the metric we wanted to track was the **mape**: we observe a drift of 2.47 % between the first and the last model deployment: which means our 'current' **mape** is 2.47% higher on the [test_set](dataset:test_set) than it was on the first recipe evaluation: Here our model has slightly worst performances but after some investigation on its behavior it is better than the previous ones.