# Hands-On Tutorial: Using PySpark in Dataiku[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#hands-on-tutorial-using-pyspark-in-dataiku "Permalink to this headline")

Out of the numerous ways to interact with Spark, the DataFrames API, introduced back in Spark 1.3, offers a very convenient way to do data science on Spark using Python (thanks to the PySpark module), as it emulates several functions from the widely used Pandas package. Let’s see how to do that in Dataiku in the short article below.

## Prerequisites[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#prerequisites "Permalink to this headline")

You have access to an instance of Dataiku with Spark enabled, and a working **installation of Spark, version 1.4+**.

We’ll use the MovieLens 1M dataset, made of 3 parts: ratings, movies and users. You can start with downloading and creating these datasets in Dataiku, and parse them using a Visual Data Preparation script to make them suitable for analysis:

You should end up with 3 datasets:

* movies

* users

* ratings

## Creating DataFrames using PySpark and DSS APIs[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#creating-dataframes-using-pyspark-and-dss-apis "Permalink to this headline")

As with regular Python, one can use Jupyter, directly embedded in DSS, to analyze interactively its datasets.

Go to the Notebook section from DSS top navbar, click **New Notebook**, and choose Python. In the new modal window showing up, select **Template: Starter code for processing with PySpark**:

You are taken to a new Jupyter notebook, conveniently filled with starter code:

Let’s start with loading our DSS datasets into an interactive PySpark session, and store them in DataFrames.

DSS does the heavy lifting in terms of “plumbing”, so loading the Datasets into DataFrames is as easy as typing the following lines of code in your Jupyter notebook:

§ # Dataiku and Spark Python APIs

§ import dataiku

§ import dataiku.spark as dkuspark

§ import pyspark

§ from pyspark.sql import SQLContext

§ # Load PySpark

§ sc = pyspark.SparkContext()

§ sqlContext = SQLContext(sc)

§ # Point to the DSS datasets

§ users\_ds = dataiku.Dataset("users")

§ movies\_ds = dataiku.Dataset("movies")

§ ratings\_ds = dataiku.Dataset("ratings")

§ # And read them as a Spark dataframes

§ users = dkuspark.get\_dataframe(sqlContext, users\_ds)

§ movies = dkuspark.get\_dataframe(sqlContext, movies\_ds)

§ ratings = dkuspark.get\_dataframe(sqlContext, ratings\_ds)

The hardest part is done. You can now start using your DataFrames using the regular Spark API.

## Exploring your DataFrames[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#exploring-your-dataframes "Permalink to this headline")

A Spark DataFrame has several interesting methods to uncover their content. For instance, let’s have a look at the number of records in each dataset:

§ print "DataFrame users has %i records" % users.count()

§ print "DataFrame movies has %i records" % movies.count()

§ print "DataFrame ratings has %i records" % ratings.count()

DataFrame users has 6040 records, DataFrame movies has 3883 records, and DataFrame ratings has 1000209 records.

You can also want to look at the actual content of your dataset using the `.show()` method:

§ users.show()

You may want to only check the column names in your DataFrame using the columns attribute:

§ dfs = [users, movies, ratings]

§ for df in dfs:

§ print df.columns

The printSchema() method gives more details about the DataFrame’s schema and structure:

§ print users.printSchema()

Note that the DataFrame schema directly inherits from the DSS Dataset schema, which comes very handy when you need to manage centrally your datasets and the associated metadata!

## Analyzing your DataFrames[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#analyzing-your-dataframes "Permalink to this headline")

Let’s have a look now at more advanced functions. Let’s start with merging the datasets together to offer a consolidated view.

§ a = ratings\

§ .join(users, ratings['user\_id']==users['user\_id'], 'inner')\

§ .drop(users['user\_id'])

§ complete = a\

§ .join(movies, a['movie\_id']==movies['movie\_id'], 'inner')\

§ .drop(movies['movie\_id'])

§ print complete.count()

§ print '\n' + complete.show()

Let’s assume you need to rescale the users ratings by removing their average rating value:

§ from pyspark.sql import functions as spfun

§ # Computing the average rating by user

§ avgs = complete.groupBy('user\_id').agg(

§ spfun.avg('rating').alias('avg\_rating')

§ )

§ # Join again with initial data

§ final = complete\

§ .join(avgs, complete['user\_id']==avgs['user\_id'])\

§ .drop(avgs['user\_id'])

§ # Create a new column storing the rescaled rating

§ df = final.withColumn('rescaled\_rating', final['rating'] - final['avg\_rating'])

How do the rescaled ratings differ by occupation code? Don’t forget that you are in regular Jupyter / Python session, meaning that you can use non-Spark functionalities to analyze your data.

§ matplotlib.style.use('ggplot')

§ # Spark DataFrame

§ stats = df.groupBy('occupation').avg('rescaled\_rating').toPandas()

§ # Pandas dataframe

§ stats.columns = ['occupation', 'rescaled\_rating']

§ stats = stats.sort('rescaled\_rating', ascending=True)

§ stats.plot(

§ kind='barh',

§ x='occupation',

§ y='rescaled\_rating',

§ figsize=(12, 8)

§ )

## Creating PySpark recipes to automate your workflow[¶](https://knowledge.dataiku.com/latest/courses/dss-and-spark/pyspark/index.html#creating-pyspark-recipes-to-automate-your-workflow "Permalink to this headline")

Finally, once your interactive session is over and you are happy with the results, you may want to automate your workflow.

First, download your Jupyter notebook as a regular Python file on your local computer, from the File => Download as… function in the Notebook menu.

Go back to the Flow screen, left click on the **ratings** dataset, and in the right pane, choose **PySpark**:

Select the 3 MovieLens datasets as inputs, and create a new dataset called *agregates* on the machine filesystem:

In the recipe code editor, copy/paste the content of the downloaded Python file, and add the output dataset:

§ # -\*- coding: utf-8 -\*-

§ import dataiku

§ import dataiku.spark as dkuspark

§ import pyspark

§ from pyspark.sql import SQLContext

§ from pyspark.sql import functions as spfun

§ # Load PySpark

§ sc = pyspark.SparkContext()

§ sqlContext = SQLContext(sc)

§ # Point to the DSS datasets

§ users\_ds = dataiku.Dataset("users")

§ movies\_ds = dataiku.Dataset("movies")

§ ratings\_ds = dataiku.Dataset("ratings")

§ # And read them as a Spark dataframes

§ users = dkuspark.get\_dataframe(sqlContext, users\_ds)

§ movies = dkuspark.get\_dataframe(sqlContext, movies\_ds)

§ ratings = dkuspark.get\_dataframe(sqlContext, ratings\_ds)

§ # Analysis

§ a = ratings\

§ .join(users, ratings['user\_id']==users['user\_id'], 'inner')\

§ .drop(users['user\_id'])

§ complete = a\

§ .join(movies, a['movie\_id']==movies['movie\_id'], 'inner')\

§ .drop(movies['movie\_id'])

§ avgs = complete.groupBy('user\_id').agg(

§ spfun.avg('rating').alias('avg\_rating')

§ )

§ final = complete\

§ .join(avgs, complete['user\_id']==avgs['user\_id'])\

§ .drop(avgs['user\_id'])

§ df = final.withColumn('rescaled\_rating', final['rating'] - final['avg\_rating'])

§ stats = df.groupBy('occupation').avg('rescaled\_rating')

§ # Output datasets

§ agregates = dataiku.Dataset("agregates")

§ dkuspark.write\_with\_schema(agregates, stats)

Hit the **Run** green button. The Spark jobs launches, and successfully completes (check your job’s logs to make sure everything went fine).

Your flow is now complete:

Using PySpark and the Spark’s DataFrame API in DSS is really easy. This opens up great opportunities for data science in Spark, and create large-scale complex analytical workflows.
