# Introduction
The Solution supports only one segmentation strategy at a time, so not all datasets are mandatory. You can find the data model on this page.

N.B.: This Solution does not support spaces in column names and string values.

There are 2 mandatory datasets, regardless of which segmentation strategy you are using:
- [stores](dataset:stores)
- [transactions](dataset:transactions)

One mandatory dataset for the demographic segmentation branch:
- [census_data](dataset:census_data)

One mandatory dataset for the sales per category segmentation branch:
- [products](dataset:products)

# Stores
The stores should all be situated in the same country.  
**One row** corresponds to **one store**.

![stores dataset.png](Y29tiLE4lTt6)

- The `store_id` is stored as a `string`.
- The `latitude` and `longitude` are stored as `double`, given in decimal degrees from the Geographic Coordinate System. A positive latitude is north of the equator, and a negative latitude is south of the equator. A positive longitude is east of the Prime Meridian, and a negative longitude is west of the Prime Meridian.

# Transactions
**One row** corresponds to **one product bought in one transaction**. We consider a transaction to be all the products that a customer bought at a certain time in a certain store.

![transactions dataset.png](S2AwI6TFgJ6g)

- The `transaction_date` should be in ISO-8601 format with a timezone indicator, for example, YYYY-MM-DDTHH:MM:SS.SSSZ.
- The `store_id` should be the same as in the [stores](dataset:stores) dataset.
- The `product_id` should be a `string` even if it is just a number.
- The `product_purchase_price` should be a `double`.
- The `transaction_id` should be a `string` even if it is just a number.

# Census Data

This dataset is mandatory for the use of the demographic segmentation strategy. The census data focuses on one country, which is the one where the stores are located. This census data required by the Solution is available in the following countries: Canada, USA, England, and France. For other countries, please check availability. All columns here are mandatory. Usually, census data is collected by small areas, all of which cover the entire country.  
**One row** corresponds to one **census area**.

![censusdata part 1.png](ptnfUdI7XjdO)
![censusdata part 2.png](MBtNryvB9yKC)

- The Dataiku meaning of `census_polygon` should be `Geometry` and the type should be `string`. It must respect the following naming convention: MULTIPOLYGON(((longitude1 latitude1, longitude2 latitude2, … ))). The Coordinate Reference System must be **EPSG:4326**. If you are not in this coordinate system, you can use the Prepare recipe with the "Change coordinate system" step.
- The `income` is an average per household per census area.
- The `age_A_B` is an age group that is defined by the bounds A and B. It is up to the user to define it. In the demo data, the age groups are created every 5 years. The name of the column has to start with "age_".
- The `occupation_X` column can store any kind of sub-population information regarding a job, job domain, job status, etc. It is up to the user to define it. The name of the column has to start with "occupation_".
- The `native` column stores the sub-population with the country citizenship.
- The `non-native` column stores the sub-population without the country citizenship.

You can view the [census_data](dataset:census_data) dataset to see an example of a census dataset that can be used for Store Segmentation.

We do not recommend having too many age and occupation groups, as it could overly complicate the analysis.

# Products

This dataset is mandatory for the use of the sales per category segmentation strategy.
**One row** corresponds to one **product**.

![products dataset.png](iiK5Kwr2G3uu)

- The `product_id` should be the same as those present in the [transactions](dataset:transactions) dataset.
- The `target_category` corresponds to the level of product category on which you want to analyze its sales performance across the stores. For example, suppose "PERSONA" is the name of a category level with the following values: "MEN," "WOMEN," and "KIDS." If the user is a category manager for the category "MEN," then the `target_category` should contain the value of the "PERSONA" level of categories.
- The `sub_category_1` corresponds to the first sub-category level under the target category.
- The `sub_category_X` corresponds to other sub-category levels.

You can view the [products](dataset:products) dataset to see an example of a product dataset that can be used for Store Segmentation.

We do not recommend having too many sub-category columns, as it could overly complicate the analysis.

# Data Sources

The census data provided for the demo comes from the [Nomis website](https://www.nomisweb.co.uk/sources/census_2021_bulk), which provides official census and labor market statistics.
