# Business Goal

The goal of the project is to analyse logs from [our website](www.dataiku.com). We want to generate the type of reports that you produce with a typical web analytics tool (Google Analytics for example).

The original data was collected with the open-sourced web tracker [WT1](https://github.com/dataiku/wt1). It contains one month of data (March 2014). Note that the IP addresses have been anonymized (random values).

<br/>
# How we do this

To build this project, we have a single data source: the web logs collected by [dataiku_com](dataset:DKU_LOGS.dataiku_com). 

# Explore the project

We've uploaded our file-system dataset containing the original logs collected by WT1. The dataset is partitioned by day. Each line of the dataset is one page view from the website by one person. The "location" column contains the URL of the page visited.

<p class="text-center">
<a href="/projects/DKU_LOGS/datasets/dataiku_com/explore/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Input data</a><br/><br/>
</p>

After we upload the data, we start by parsing the date, we clean our data and geo-locate our IP address column. We then clean the referrer column and create a 'category' column based on the URL.

<p class="text-center">
<a href="/projects/DKU_LOGS/recipes/compute_dataiku_com_cleaned/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Cleaning recipe</a><br/><br/>
</p>

In the second step, we sync the data on a PostgreSQL database (so we can write SQL, and so our calculations and graphs run faster) . Note that we also un-partition the dataset (it's easier for this project).

<p class="text-center">
<a href="/projects/DKU_LOGS/recipes/sync/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Sync recipe</a><br/><br/>
</p>

Then, our project splits into two branches. 

- In our first branch, we group our dataset by visitor_id using SQL, so we can get all the info for each of our visitors.

<p class="text-center">
<a href="/projects/DKU_LOGS/recipes/compute_agg_by_visitor/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;SQL Group recipe</a><br/><br/>
</p>

Next, we use a visual group by recipe to get the first referrer per visitor-id. This gives us each visitor's point of entry on the site.

<p class="text-center">
<a href="/projects/DKU_LOGS/recipes/group_logs_on_sql_by_visitor_id/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Visual Group recipe</a><br/><br/>
</p>

After this, we join those two datasets, and clean them to build a map on them (by creating one observation per geographic point).

- In the other branch, we just group our data by day so we can get the number of visitors per day.

<p class="text-center">
<a href="/projects/DKU_LOGS/recipes/group_logs_visitors_per_day/"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Visual Group recipe</a><br/><br/>
</p>

# Explore the insights

On these different datasets, we created some graphics. All of them are attached on the dashboard.

You can find:

- pages views per location
- pages views per week day
- top referrers
- daily visitors


<p class="text-center">
<a href="/projects/DKU_LOGS/dashboards/03cAXhk_web-logs-dashboard/view/xg5T4bv"  class="btn btn-datasets-color btn-cta-big-mod"><i class="icon-dku-sample_project" class="btn-cta-big-mod-icon" />&nbsp;Dashboard</a><br/><br/>
</p>

<br/>
#Related content

If you want to know more (we know you do), we recommend you look into these resources:

-  [Visual Window Analytic Functions](http://www.dataiku.com/learn/guide/visual/window/using-the-window-recipe.html)
-  [Enriching Web Logs](http://www.dataiku.com/learn/guide/visual/prepare/enrich-web-logs.html)
-  [Repartitioning a non-partitioned dataset](http://www.dataiku.com/learn/guide/other/partitioning/partitioning-redispatch.html)