# Project Goal

The goal of this project is to extract various POI (Points Of Interests) from Open Data sources (OSM and Foursquare) to be able to do a relevant geographical segmentation of Manhattan.

# How do we do this

<p  class="text-center"> <a href="flow/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />FLOW</a> </p>

We have two datasources: Open Street Maps and Foursquare

- We assume that we already have the tables **[Ways](datasets/ways/explore/)** and **[Nodes](datasets/nodes/explore/)**  from OSM. They contain all the information about streets and buildings in Manhattan with their localization.

- We retrieve **[Foursquare data](datasets/foursquare_iris/explore/)** from their Public API

We will use the **[census block dataset](http://data.beta.nyc/dataset/2010-census-block-groups-polygons/resource/9cd1a0a0-d75a-482a-a91d-763857260651)** from the data portal of the city of New York City as a grid for the borough of Manhattan.

All the POI retrieved from OSM and Foursquare can be associated to a specific **[census block location](datasets/NYC_census_blocks_prepared/explore/)** of Manhattan.

We will compute **[features](datasets/heatmap_all_poi_nyc/explore/)** for each census block of Manhattan based on aggregations of the POI we retrieved. We aggregate businesses and locations based on their type. So for example we have for each zone the number of food-related locations, from both OSM and Foursquare.

We then create a segmentation with a **[KMEANS](savedmodels/3CPb8bSE/versions/)** clustering.

# Explore the project
- You can first look at the flow to understand the global structure of the project

<p  class="text-center"> <a href="flow/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />FLOW</a> </p>

**Tag: block_data**
- Census block data is retrieved from an online source, and we use a preparation recipe to recover polygon information for Manhattan only, clean it and generate some geopoints.
<p  class="text-center"> <a href="datasets/NYC_census_blocks_prepared/explore/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Paris IRIS data</a> </p>

## Foursquare part of the Flow
**Tag: foursquare_data**

- You can have a look at how we extract Foursquare data with their public API. (We need to specify at the beginning of code a public and private client id, and a directory where you want to store the data).

<p  class="text-center"> <a href="recipes/compute_foursquare_nyc/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Get Foursquare data</a> </p>

- You can [look at the raw data](datasets/foursquare_nyc_complete/explore/) or have a look at the steps we use to clean it in a prepare recipe:

<p  class="text-center"> <a href="recipes/compute_foursquare_iris_prepared/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Cleaning</a> </p>

- We then aggregate the data at the zone level with a visual grouping recipe.

<p  class="text-center"> <a href="recipes/group_poi_foursquare/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Aggregation: grouping recipe </a> </p>

## OSM part of the Flow
**Tag: osm_data**

- You can have a look at how we extract POI from OSM data. It is a rather complicated SQL script that checks the tags of the tables ways and nodes to allocate them into several pre-defined categories.

<p  class="text-center"> <a href="recipes/compute_poi_osm/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Extract POI from OSM</a> </p>

- Then, we do a Geo-Join between the POI we retrieved and the census block locations, and perform the same aggregation as for Foursquare data. We perform all of this in a single SQL query using **[Postgis](http://www.postgis.net/)** which is a very powerful Postgres Library to compute geographical queries.

<p  class="text-center"> <a href="recipes/compute_heatmap_iris_osm/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Geo Join and Aggregation</a> </p>

## Segmentation
**Tag: clustering data and model**

- We start with a visual join between our foursquare and OSM data.

<p  class="text-center"> <a href="recipes/join_heatmap_nyc_osm/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Visual join</a> </p>

- The data we will use for clustering is now complete:
<p  class="text-center"> <a href="datasets/heatmap_all_poi_nyc/explore/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />See the final dataset</a> </p>

- We perform clustering on these data, after several tries we decide to choose a K-means algorithm with 7 clusters. Using the Numerical Heatmap, we are able to characterize our clusters and rename them accordingly.

We find the following clusters:
1. Residential: less shops and restaurants than elsewhere
2. Residential with services: well connected with shops and restaurants
3. Cultural & going out places: well connected with lots of cultural places and clubs
4. Activity centers: large activity and large number of events
5. Shopping areas: more shops and less cultural/going out places
6. Universities: well connected with public services and university facilities
7. Randall's Island: alone within its cluster of parkings and entertainement

<p  class="text-center"> <a href="savedmodels/3CPb8bSE/versions/" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Model</a> </p>

## Visualization

- After deploying our model in the flow and scoring the data, we are able to visualize the results on a chart. If you know Manhattan a little bit, you can have a look at the map we published on the dashboard to see how relevant this clustering is.

<p  class="text-center"> <a href="dashboards/IrII3VD_manhattan-clustering-map/view/ql2p0qS" class="btn
btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project"
class="btn-cta-big-mod-icon" />Dashboard</a> <br><br> We believe you'll find it to be pretty accurate!</p>