The [demography_feature_engineering](flow_zone:7yZk0IS) Flow Zone aims to aggregate all the census data for each store (or trade area). The census data includes the following information:
- Age
- Gender
- Nationality
- Occupation or Employment
- Income
- Workplace
- Household Composition

Please refer to the [Data Model](article:15) article for more details about this data.

These pieces of information are represented by numerical variables, which count the sub-population from a category. For example, there is a `gender_male` variable that represents the number of people identifying as male. It's considered a sub-population.

The first step to creating a dataset where each store has sub-population information is to change the data type of the `census_polygon` column in the [census_data_prepared](dataset:census_data_prepared) from a string to a geometry type.

This operation is necessary to then join the census data ([census_data_prepared](dataset:census_data_prepared)) with the store data ([store_trade_areas](dataset:store_trade_areas)) using the geojoin recipe [compute_stores_with_census_data](recipe:compute_stores_with_census_data). The data is joined on the `store_trade_area` column on one side and the `census_polygon` column on the other side. There are multiple options provided in the Project Setup:
- Join census areas if they are contained in the trade area.
- Join census areas if they intersect the trade area.
- Join census areas if they are contained in or intersect the trade area.

Here is the view of the Project Setup:
![geojoin parameters.png](gj09ptDrEqtu)

We then retain only the `store_id` information from the [store_trade_areas](dataset:store_trade_areas) dataset and all census information from the [census_data_prepared](dataset:census_data_prepared) dataset.

At this point, the join recipe output dataset, [stores_with_census_data](dataset:stores_with_census_data), has as many rows for each store as there are census areas within the trade area.

To get only one row per store, we use the group recipe [compute_stores_with_census_data_aggregated](recipe:compute_stores_with_census_data_aggregated) to aggregate the data for each trade area, sum the sub-population counts, and calculate an average income. This gives us the [stores_with_census_data_aggregated](dataset:stores_with_census_data_aggregated) dataset.

At this stage, we can recall the two choices given to the user in the Project Setup to aggregate the census data:
- Compute the sum of the sub-population counts.
- Calculate a ratio of the sub-population to the total population in the trade area.

In the first case, the Flow Zone looks like this, and we can directly join the [stores_with_census_data_aggregated](dataset:stores_with_census_data_aggregated) dataset to the [distinct_stores_prepared](dataset:distinct_stores_prepared) dataset.
![demographic-feature-eng-zone.png](49YM7G8bQgfP)

In the second scenario, the Flow Zone looks like this, and we calculate the sub-population ratio through the [compute_stores_with_census_data_aggregated_percentage](recipe:compute_stores_with_census_data_aggregated_percentage) Python recipe. We then join the output dataset to the [distinct_stores_prepared](dataset:distinct_stores_prepared) dataset.
![demography feature engineering ratio.png](wQl9RlcrChYz)

Thus, the final dataset [stores_with_all_census_data](dataset:stores_with_all_census_data) is ready to be clustered in the next Flow Zone.

