# Hands-On Tutorial: Working with Shapefiles and US Census Data in Dataiku[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#hands-on-tutorial-working-with-shapefiles-and-us-census-data-in-dataiku "Permalink to this headline")

This brief tutorial introduces working with shapefiles and US Census data in Dataiku. It covers importing shapefiles, enriching them with demographic data from the US Census, and mapping the results.

The final project Flow is shown below. A completed version of the project, including the final choropleth, can be found on the Dataiku gallery.

## Technical Requirements[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#technical-requirements "Permalink to this headline")

* The Reverse Geocoding / Admin maps plugin for producing maps with administrative boundaries

* The Census USA plugin for downloading US Census data

## Supporting Data[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#supporting-data "Permalink to this headline")

* The example data for this exercise is TIGER/Line Shapefiles from the US Census Bureau. They contain official 2019 US county borders among some other information, and can be downloaded as a zip archive here.

We will also download US Census data through the Census USA plugin.

## The Shapefile Format[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#the-shapefile-format "Permalink to this headline")

When working with spatial or geographic data, you will encounter many different types of file formats such as **.geojson**, **.gpkg**, **.csv**, and **.tiff**. One of the most common though is the shapefile, initially created by ESRI.

Although often referred to as a singular file, a shapefile is actually a collection of typically four (and potentially other) files (.shp, .shx, .dbf, and .proj).

Together, these files can spatially describe vector features such as points, lines and polygons.

## Importing Shapefiles[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#importing-shapefiles "Permalink to this headline")

Dataiku provides built-in support for the shapefile format.

In a new blank project, (we’ve named ours `Working with Shapefiles`), create a new dataset by uploading the downloaded zip archive to Dataiku.

* On the Format/Preview tab, ensure the selected Type is **Shapefile**, if not already done so.

* Create the new dataset with the name `us\_counties`.

## Exploring Shapefiles[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#exploring-shapefiles "Permalink to this headline")

After importing the dataset, the Explore tab shows a preview of the data in a tabular format.

The first column, *the\_geom*, specifies the dataset’s geometry.

* It is stored as a string, but Dataiku can interpret its meaning to be Geometry. Each row, a county, is stored as a multipolygon.

The second column, *shp\_srs*, specifies the dataset’s Spatial Reference System (SRS), also known as a Coordinate Reference System (CRS).

* A spatial reference system defines how the spatial elements of the data relate to the Earth’s surface. In this case, the dataset uses one of the most common geographic SRS, EPSG:4269.

## Shapefiles in a Visual Recipe[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#shapefiles-in-a-visual-recipe "Permalink to this headline")

Shapefiles can be manipulated in Dataiku like any other dataset. Let’s use it in a visual recipe.

* From the *us\_counties* dataset, initiate a **Sample/Filter** recipe.

* Name the output dataset `nj\_counties`.

* Filter the dataset to keep only rows where *STATEFP* equals `34`, the FIPS code for the state of New Jersey.

After running the recipe, note that the output dataset now has 21 rows, one for each county in New Jersey.

Note

Note that the same result of the Filter recipe could be achieved with a **Filter rows/cells on value** processor in a Prepare recipe.

## Downloading US Census Data[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#downloading-us-census-data "Permalink to this headline")

We now have a dataset where each row holds the shape of a county in New Jersey. As of now though, there is no demographic data attached to the counties.

The Census USA plugin has a number of features relating to Census data, including an easy way to download data from the US Census Bureau.

* From the Flow, select **+ New Dataset > Census USA > US Census dataset**.

* For the “State”, provide `nj`. Ensure “State format” is **state\_2letters**.

* Select **ACS5YR2017** as “Census content” and **County** as “Census level”.

* The “Census field” is a string of variables (without spaces). Add `B00001\_001E,B19013\_001E` to retrieve data for total population and median household income, respectively.

* Click “Test & Get Schema”.

* Name the output dataset `nj\_demo`.

Once this dataset has been created, we now have the population and an estimate of median household income for each county in the state.

Note

There are a number of ways to find out the code for a particular census variable. One way is by building the **US Census metadata** dataset in the Census USA plugin.

## Enriching Shapefiles with Census Data[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#enriching-shapefiles-with-census-data "Permalink to this headline")

Before we can join our spatial and demographic data, we’ll do a few brief preparation steps on the demographic data.

* Initiate a **Prepare** recipe on *nj\_demo*.

* Rename the variable code columns so they are easier to remember.

+ Rename *B00001\_001E* to `population`.

+ Rename *B19013\_001E* to `med\_household\_income`.

* We also want to edit the Schema in the Settings tab so that *GEOID\_DKU* is a string. Although this column may look numeric, we cannot do any calculations with these “numbers”. Moreover, the *GEOID* column of *nj\_counties* is stored as a string, and we’ll use these two columns as the join key.

We can now join the datasets of shapefiles and demographic information as we would any two datasets.

* In a **Join** recipe, left join *nj\_demo\_prepared* to *nj\_counties*.

* In the Join step, use the *GEO\_ID* column of *nj\_counties* and the *GEOID\_DKU* column of *nj\_demo\_prepared* as the key.

* In the Selected columns step, retain only *population* and *med\_household\_income* from *nj\_demo\_prepared*.

Running the recipe should produce an output dataset with two additional columns, the population and the median household income for each county.

In this case, you can ignore the INPUT\_DATA\_VERY\_LONG warning. We can see that Dataiku is just warning that the column holding the geometry of our shapefile is unusually large (compared to a typical column).

## Mapping Shapefiles[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#mapping-shapefiles "Permalink to this headline")

Now we can visualize the distribution of our demographic variables on a map.

* On the Charts tab of the *nj\_counties\_joined* dataset, create a **Filled Administrative** map.

* Drag *the\_geom* column to the Geo field. Adjust the level of detail from Country to Department/County.

* Drag the *population* or *med\_household\_income* column to the color droplet field.

* Adjust the color palette to your preference.

## What’s Next?[¶](https://knowledge.dataiku.com/latest/kb/analytics-ml/geospatial/shapefiles.html#what-s-next "Permalink to this headline")

Congratulations! You’ve seen how to import, manipulate, and visualize shapefiles and US Census data in Dataiku.

Recall that you can find the completed version of this project in the Dataiku gallery.

Next, you might be interested in turning this map into a web app that can allow end users to toggle between variables.
