# Web Logs Analysis[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#web-logs-analysis "Permalink to this headline")

Contents

* Overview

* Create the Project

* Preparing the Web Logs Data

* Referrer Analysis

* Visitor Analysis

* Wrap-up

## Overview[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#overview "Permalink to this headline")

### Business Case[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#business-case "Permalink to this headline")

The customer team at Dataiku is interested in using the website logs to perform two kinds of analysis:

* **Referrer Analysis**

+ determine how visitors get to our website

+ identify who are the top referrers, both in terms of volume (number of visitors) and depth (number of pages linked)

* **Visitor Analysis**

+ segment visitors according to how they engage with the website

+ map these segments to known customers in order to feed them into the right channels (Marketing, Prospective Sales, and Sales)

### Supporting Data[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#supporting-data "Permalink to this headline")

This use case is based on two input data sources. The downloadable archives are found below:

* Web Logs: The Dataiku website logs, spanning 2 months, that contain information about each individual pageview on the website.

* CRM: A simulated Customer Relationship Management (CRM) database containing transactional and demographic data about our clients.

### Workflow Overview[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#workflow-overview "Permalink to this headline")

The final Dataiku DSS workflow should look like the image below. You can also follow along with the completed project in the Dataiku gallery.

You will go through the following high-level steps:

* Upload the datasets

* Clean up and enrich the log data

* Use visual grouping recipes

* Run a clustering model to build segments

* Join the CRM and segment data for known visitors

* Customize and split dataset by segments

### Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#prerequisites "Permalink to this headline")

You should be familiar with:

* The Basics courses

* Machine learning in Dataiku DSS

### Technical Requirements[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#technical-requirements "Permalink to this headline")

* The Reverse Geocoding / Admin maps plugin is required to produce administrative maps.

## Create the Project[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#create-the-project "Permalink to this headline")

Create a new Dataiku DSS project and name it `Web Logs Analytics`.

## Preparing the Web Logs Data[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#preparing-the-web-logs-data "Permalink to this headline")

Create a new UploadedFiles dataset from the web logs data (LogsDataiku.csv.gz) and name it `LogsDataiku`.

The dataset is already quite clean so the data preparation steps will focus mostly on enrichment and feature engineering. Create a **Prepare** recipe with *LogsDataiku* as the input and add the following steps to the script.

Hint

Clicking the down arrow to the right of the column header brings quick access to the most common data preparation steps for the column’s associated meaning. Many of the required steps below can be executed from this window.

* **Cleaning**

+ Remove four columns: *br\_width*, *br\_height*, *sc\_width* and *sc\_height*

+ Rename the column *client\_addr* to *ip\_address*

+ Clear invalid cells in the *ip\_address* column for the IP address meaning

+ Rename the column *location* to *url*

* **Feature engineering dates, locations, and user agents**

+ Parse the *server\_ts* column into a new Date column, *server\_ts\_parsed*, with format “yyyy-MM-dd’T’HH:mm:ss.SSS”

+ Extract four date components from *server\_ts\_parsed*: named *month*, *day*, *day\_of\_week*, and *hour*

+ Geo-locate the *ip\_address* column, extracting the country, city and GeoPoint with the *ip\_address\_* prefix

+ Classify, or enrich, the *user\_agent* column

+ Remove *user\_agent* and all of the enriched columns, with the exception of *user\_agent\_type*

Note

Note that parsing a date into the existing column will change the column’s meaning, but not its storage type. Parsing a date into a new column, on the other hand, changes both type and meaning. This will have consequences if you wish to perform mathematical operations on the column. For a more detailed guide on managing dates, please see the reference documentation.

* **Feature engineering the Dataiku URLs**

+ Split URL in *url*, extracting only the path

+ Split *url\_path* on `/` and select **Truncate** so that it keeps only the first output column starting from the beginning

+ Fill empty cells of *url\_path\_0* with `home` using the “Fill empty cells with fixed value” processor

+ Create dummy columns from values of *url\_path\_0* using the Unfold processor. This step creates new columns representing the section of the website the visitor was on with a “1”.

+ Remove the *url\_path* and *url\_path\_0* columns

* **Feature engineering the referrer URLs**

+ Split URL in *referer*, extracting only the hostname

+ Use the Find and replace processor on *referer\_host*, replacing `t.co` with `twitter.com` and matching on the complete value of the string

+ In the same column, replace `www.` with an empty expression (i.e. no value), matching on substring

+ Once more for *referer\_host*, replace `\..\*` with an empty expression, matching on regular expression. This step allows us to later put all traffic from the local Google domains under a single group.

+ Reduce clutter by removing eight more columns: *server\_ts*, *referer*, *type*, *visitor\_params*, *session\_params*, *event\_params*, *br\_lang*, and *tz\_off*

Run all 19 steps of the recipe, updating the schema to 21 columns. We now have a clean, enriched dataset containing information about our website visits!

### Charts for the Prepared Logs[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#charts-for-the-prepared-logs "Permalink to this headline")

Now that our data has been cleaned and enriched, three charts can help guide our initial analysis. | - **Distribution of visits by day of week.**

* Create a Histogram with *Count of records* on the Y-axis and *server\_ts\_parsed* on the X-axis, with “Day of week” as the date range.

* The resulting chart shows that the number of visits is highest on Friday and Monday, lowest on the weekend, and visits steadily decline from Monday through Thursday before spiking on Friday.

* **Daily timeline of visits from France and the US**

+ Create a Lines chart with *Count of records* on the Y-axis and *server\_ts\_parsed* on the X-axis, with “Day” as the date range.

+ Drag *ip\_address\_country* to the “And” subgroup field; plus add a filter on *ip\_address\_country* with only France and the United States selected.

+ The resulting chart suggests that, during the available period, the number of visits on any given day is highly variable; visits from France tend to outnumber those from the US; and aside from a spike on 2014-03-28, the number of daily visits is under 400.

* **Choropleth of visits by country of origin**

+ Create a **Filled Administrative Map** with *ip\_address\_geopoint* providing the geographic information and *Count of records* providing the color details.

+ Changing the selected color palette to a multi-color palette, such as Viridis, and the color scale mode to Logarithmic can help differentiate the countries.

+ The resulting chart shows that after France and the United States, the most visitors come from the United Kingdom, India, and Canada.

## Referrer Analysis[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#referrer-analysis "Permalink to this headline")

We want to identify the top referrers to the Dataiku website in terms of volume (i.e. number of pageviews and distinct visitors), as well as their level of engagement (i.e. number of distinct Dataiku URLs). In order to achieve this, we need to group the dataset by unique values within the *referer\_host* column.

From the *LogsDataiku\_prepared* dataset, create a new **Group** recipe with *referer\_host* as the column to group by. Keep the default output name `LogsDataiku\_prepared\_by\_referer\_host`.

At the Group step, keep **Compute count for each group** selected and add the following per-field aggregations:

* For *server\_ts\_parsed*: Min, Max

* For *visitor\_id* and *url*: Distinct

Run the recipe, updating the schema to six columns.

In the output dataset, click on the **Charts** tab and create the following visualizations described below. Later, we could publish these charts to a dashboard.

* **Pageviews, distinct visitors and distinct URLs per referrer**.

+ Create a pivot table by dragging *referer\_host* to the rows and *count*, *visitor\_id\_distinct*, and *url\_distinct* as contents.

+ *referer\_host* should be sorted by descending order of *count*.

+ Keep the AVG aggregation for all of the contents.

* **Number of pageviews per referrer (excluding Dataiku, Google, and No value).**

+ Create a bar chart with *count* on the X-axis and *referer\_host* on the Y-axis.

+ *count* should have the SUM aggregation, and *referer\_host* should be sorted by descending sum of *count*.

+ Add *referer\_host* host as a filter, excluding dataiku, google, and No value. Under the Display menu, check the box **Show horizontal axis**.

+ The resulting chart shows that “journaldunet” is the largest single referrer by a wide margin.

## Visitor Analysis[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#visitor-analysis "Permalink to this headline")

Now let’s attempt to segment visitors into categories and direct customers to the most appropriate channel. Our visitor analysis has the following high-level steps:

* Group visits by unique visitors

* Segment visitors using a clustering model

* Join the cluster labels with customer data

* Send the segmented data to appropriate channels for further engagement

### Grouping Visitors[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#grouping-visitors "Permalink to this headline")

Using a similar technique as for referrers, we will now examine the behavior of website visitors across time (sessions).

Returning to the *LogsDataiku\_prepared* dataset, create a new **Group** recipe with *visitor\_id* as the column to group by. Keep the default output name, `LogsDataiku\_prepared\_by\_visitor\_id`.

At the Group step, keep **Compute count for each group** selected, and add the following per-field aggregations:

* For *day*, *day\_of\_week*, *hour*, *session\_id*, and *url*: Distinct

* For *ip\_address\_country* and *user\_agent\_type*: First

* For *blog*, *applications*, *home*, *company*, *products*, and *data-science*: Sum

Run the recipe, updating the schema to 15 columns.

Now for each unique visitor, we know information such as their IP address, their number of visits, the specific pages they visited, and their device (browser vs. mobile).

### Clustering Web Visitors[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#clustering-web-visitors "Permalink to this headline")

Let’s use this data to cluster visitors into certain categories that may help our colleagues in Marketing and Sales to be more targeted in their outreach efforts.

From the output dataset grouped by *visitor\_id*, go into the **Lab** and create a **Quick Model** under **Visual analysis**.

Then choose a **Clustering** task and, in a Quick model style, a K-Means clustering model. Keep the default analysis name.

Click **Train** to train the first model. You do not need to provide a session name or description.

Dataiku will recognize the data types and use Machine Learning best practices when training a default clustering model. Expect the final cluster profiles to vary because of sampling effects, but there will be five primary clusters, plus a sixth cluster containing observations that do not fit any of the primary clusters.

Clicking on the model name, here **KMeans (k=5)**, brings up the Summary tab, where you can find a wide array of model metrics and information. In the **Summary** tab, click the pencil icon next to each default cluster name to rename the clusters according to the suggestions below:

| Cluster | Larger-than-usual numbers of: |

| --- | --- |

| US visitors | Americans |

| French visitors | French |

| Frequent visitors | Distinct visits |

| Engaged visitors | Distinct URLs visited |

| Sales prospects | Visits to the products page |

From the button at the top right, **Deploy** the model to the Flow as a retrainable model.

Apply it to the *LogsDataiku\_prepared\_by\_visitor\_id* dataset. Keep the default model name.

Selecting the model from the Flow, initiate an **Apply** recipe. Choose *LogsDataiku\_prepared\_by\_visitor\_id* as the input.

Name the output dataset `LogsDataiku\_segmented`.

Create and run the recipe.

Now, one column, *cluster\_labels*, has been added to the output *LogsDataiku\_segmented*.

### Joining the Clusters and Customer Data[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#joining-the-clusters-and-customer-data "Permalink to this headline")

In order to map these segments to known customers, we’ll need our customer data.

If not already having done so, create a new UploadedFiles dataset from the customer data (CRM.csv.gz) and name it `CRM`.

From *LogsDataiku\_segmented*, initiate a **Join** recipe, adding *CRM* as the second input. Keep the default output, *LogsDataiku\_segmented\_joined*.

In the **Join** step, *visitor\_id* should be automatically recognized as the join key. Change the type of join to an **Inner Join** to keep only records where a customer can be successfully matched with website visits.

After completing the join, expect an output of 5602 rows and 35 columns.

### Customizing and Activating Segments[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#customizing-and-activating-segments "Permalink to this headline")

Now we want to feed these customers into the right channel: Marketing, Prospective Sales, or Sales.

Create a **Prepare** recipe from the *LogsDataiku\_segmented\_joined* dataset. Keep the default output name.

Add a new step with the Formula processor creating a new column, *Cluster\_Final*, using the expression defined below:

§ if(cluster\_labels == "US visitors" || cluster\_labels == "French visitors", "Marketing" +

§ if(user\_agent\_type\_first =="browser", " - browser", " - mobile"),

§ if(applications\_sum >= 2 || products\_sum >= 2, "Sales", "Sales prospecting"))

Run the recipe, updating the schema to 36 columns.

Now let’s split the dataset on this newly created variable so we can ship the smaller subsets to the right teams.

From the output dataset, *LogsDataiku\_segmented\_joined\_prepared*, create a **Split** recipe that sends each of the four values of Cluster\_Final to individual datasets. Use the names:

* `Send\_to\_Sales`

* `Send\_to\_Sales\_Prospecting`

* `Send\_to\_Marketing\_browser`

* `Send\_to\_Marketing\_mobile`

Choose **Map values of a single column** as the Splitting method.

Choose *Cluster\_Final* as the column on which to split.

Add the values and outputs according to the screenshot below.

Run the recipe.

These newly generated datasets are immediately usable by customer-facing team members to send targeted emails!

## Wrap-up[¶](https://knowledge.dataiku.com/latest/courses/use-cases/web-logs/index.html#wrap-up "Permalink to this headline")

Congratulations! We created an end-to-end workflow to build datasets as collateral for colleagues in Sales and Marketing. Thank you for your time working through this use case.

Remember, you can always refer to the completed project in the Dataiku gallery.
