# Using Snowpark Python in Dataiku: basics[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#using-snowpark-python-in-dataiku-basics "Permalink to this heading")

Pre-requisites

* Dataiku >= 10.0.7

* A Snowflake connection with Datasets containing the NYC Taxi trip and zone data, referred to as `NYC\_trips` and `NYC\_zones`.

* A Python 3.8 code environment with the `snowflake-snowpark-python[pandas]` package installed.

## What is Snowpark?[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#what-is-snowpark "Permalink to this heading")

Snowpark is a set of libraries to programmatically access and process data in Snowflake using languages like Python, Java or Scala. It allows the user to manipulate *DataFrames* similarly to Pandas or PySpark. The Snowflake documentation provides more details on how Snowpark works under the hood.

In this tutorial, you will work with the `NYC\_trips` and `NYC\_zones` Datasets to discover a few features of the Snowpark Python API and how they can be used within Dataiku to:

* Faciliate reading and writing Snowflake Datasets.

* Perform useful/common data transformation.

* Leverage User Defined Functions (UDFs).

## Creating a Session[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#creating-a-session "Permalink to this heading")

Whether using Snowpark Python in a Python recipe or notebook, you’ll first need to create a Snowpark Session.

A Session object is used to establish a connection with a Snowflake database. Normally, this Session would need to be instantiated with the user manually providing credentials such as the user id and password. However, the `get\_session()` method reads all the necessary parameters from the Snowflake connection in DSS and thus exempts the user from having to handle credentials manually.

Start by creating a Jupyter notebook with the code environment mentioned in the pre-requisites and instantiate your Session object:

§ from dataiku.snowpark import DkuSnowpark

§ sp = DkuSnowpark()

§ # Replace with the name of your Snowflake connection

§ session = sp.get\_session(connection\_name="YOUR-CONNECTION-NAME")

## Loading data into a DataFrame[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#loading-data-into-a-dataframe "Permalink to this heading")

Before working with the data, you first need to read it, more precisely to *load it from a Snowflake table into a Snowpark Python DataFrame*. With your `session` variable, create a Snowpark DataFrame using one of the following ways:

### Option 1: with the Dataiku API[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#option-1-with-the-dataiku-api "Permalink to this heading")

The easiest way to query a Snowpark DataFrame is by using the `get\_dataframe()` method and passing a `dataiku.Dataset` object. The `get\_dataframe()` can optionally be given a Snowpark Session argument. Dataiku will use the session created above or create a new one if no argument is passed.

§ import dataiku

§ NYC\_trips = dataiku.Dataset("NYC\_trips")

§ df\_trips = sp.get\_dataframe(dataset=NYC\_trips)

### Option 2: with a SQL query[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#option-2-with-a-sql-query "Permalink to this heading")

Using the `session` object, a DataFrame can be created from a SQL query.

§ # Get the name of the dataiku.Dataset's underlying Snowflake table.

§ trips\_table\_name = NYC\_trips.get\_location\_info().get('info', {}).get('table')

§ df\_trips = session.sql(f"Select \* from {trips\_table\_name}")

Unlike Pandas DataFrames, Snowpark Python DataFrames are lazily evaluated. This means that they, and any subsequent operation applied to them, are not immediately executed.

Instead, they are recorded in a Directed Acyclic Graph (DAG) that is evaluated only upon the calling of certain methods (`collect()`, `take()`, `show()`, `toPandas()`).

This lazy evaluation minimizes traffic between the Snowflake warehouse and the client as well as client-side memory usage.

## Retrieving rows[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#retrieving-rows "Permalink to this heading")

* The `take(n)` method is the only method that allows users to pull and check **n** rows from the Snowpark DataFrame. Yet, it is arguably not the most pleasant way of checking a DataFrame’s content.

§ # Retrieve 5 rows

§ df\_trips.take(5)

* The `toPandas()` method converts the Snowpark DataFrame into a more aesthetically-pleasing Pandas DataFrame. Avoid using this mehod if the data is too large to fit in memory. Instead, leverage the `to\_pandas\_batches()` method. Alternatively, you can use a limit statement before retrieving the results as a Pandas DataFrame.

§ df\_trips.limit(5).toPandas()

## Common operations[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#common-operations "Permalink to this heading")

The following paragraphs illustrate a few examples of basic data manipulation using DataFrames:

### Selecting column(s)[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#selecting-column-s "Permalink to this heading")

Snowflake stores unquoted column names in uppercase. Be sure to use double quotes for case-sensitive column names. Using the `select` method returns a DataFrame:

§ from snowflake.snowpark.functions import col

§ fare\_amount = df\_trips.select([col('"fare\_amount"'),col('"tip\_amount"')])

§ # Shorter equivalent version:

§ fare\_amount = df\_trips.select(['"fare\_amount"','"tip\_amount"'])

### Computing the average of a column[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#computing-the-average-of-a-column "Permalink to this heading")

Collect the mean `fare\_amount`. This returns a 1-element list of type `snowflake.snowpark.row.Row`:

§ from snowflake.snowpark.functions import mean

§ avg\_row = df\_trips.select(mean(col('"fare\_amount"'))).collect()

§ avg\_row # results [Row(AVG("FARE\_AMOUNT")=12.556332926005984)]

You can access the value as follows:

§ avg = avg\_row[0].asDict().get('AVG("FARE\_AMOUNT")')

### Creating a new column from a case expression[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#creating-a-new-column-from-a-case-expression "Permalink to this heading")

Leverage the `withColumn()` method to create a new column indicating whether a trip’s fare was above average. That new column is the result of a case expression (`when()` and `otherwise()`):

§ from snowflake.snowpark.functions import when

§ df\_trips = df\_trips.withColumn('"cost"', when(col('"fare\_amount"') > avg, "high")\

§ .otherwise("low"))

§ # Check the first five rows

§ df\_trips.select(['"cost"', '"fare\_amount"']).take(5)

### Joining two tables[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#joining-two-tables "Permalink to this heading")

The `NYC\_trips` contains a pick up and drop off location id (*PULocationID* and *DOLocationID*). We can map those location ids to their corresponding zone names using the `NYC\_zones` Dataset.

To do so, perform two consecutive joins on the *OBJECTID* column in the NYC zone Dataset.

§ import pandas as pd

§ # Get the NYC\_zones Dataset object

§ NYC\_zones = dataiku.Dataset("NYC\_zones")

§ df\_zones = sp.get\_dataframe(NYC\_zones)

§ df\_zones.toPandas()

Finally, perform the two consecutive left joins. Note how you are able to chain different operations including `withColumnRenamed()` to rename the *zone* column and `drop()` to remove other columns from the `NYC\_zones` Dataset:

§ df = df\_trips.join(df\_zones, col('"PULocationID"')==col('"OBJECTID"'))\

§ .withColumnRenamed(col('"zone"'), '"pickup\_zone"')\

§ .drop([col('"OBJECTID"'), col('"PULocationID"'), col('"borough"')])\

§ .join(df\_zones, col('"DOLocationID"')==col('"OBJECTID"'))\

§ .withColumnRenamed(col('"zone"'), '"dropoff\_zone"')\

§ .drop([col('"OBJECTID"'), col('"DOLocationID"'),col('"borough"')])

### Group By[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#group-by "Permalink to this heading")

Count the number of trips by pickup zone among expensive trips. Use the `filter()` method to remove cheaper trips. Then use the `groupBy()` method to group by *pickup\_zone*, `count()` the number of trips and `sort()` them by descending order. Finally, call the `toPandas()` method to store the results of the group by as a Pandas DataFrame.

§ results\_count\_df = df.filter((col('"cost"')=="low"))\

§ .groupBy(col('"pickup\_zone"'))\

§ .count()\

§ .sort(col('"COUNT"'), ascending=False)\

§ .toPandas()

§ results\_count\_df

## User Defined Functions (UDF)[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#user-defined-functions-udf "Permalink to this heading")

Snowpark’s use would rather be limited if it wasn’t for UDFs.

A User Defined Functions (UDF) is a function that, for a single row, takes the values of one or several cells from that row, and returns a new value.

UDFs effectively allow users to transform data using custom complex logic beyond what’s possible in pure SQL. This includes the use of any Python packages.

To be used, UDFs first need to be *registered* so that at execution time they can be properly sent to the Snowflake servers. In this section, you will see a simple UDF example and how to register it.

### Registering a UDF[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#registering-a-udf "Permalink to this heading")

* The first option to register a UDF is to use either the `register()` or the `udf()` function. In the following code block is a simple UDF example that computes the tip percentage over the taxi ride total fare amount:

§ from snowflake.snowpark.functions import udf

§ from snowflake.snowpark.types import FloatType

§ def get\_tip\_pct(tip\_amount, fare\_amount):

§ return tip\_amount/fare\_amount

§ # Register with register()

§ get\_tip\_pct\_udf = session.udf.register(get\_tip\_pct, input\_types=[FloatType(), FloatType()],

§ return\_type=FloatType())

§ # Register with udf()

§ get\_tip\_pct\_udf = udf(get\_tip\_pct, input\_types=[FloatType(), FloatType()],

§ return\_type=FloatType())

* An alternative way of registering the `get\_tip\_pct()` function as a UDF is to decorate your function with `@udf` . If you choose this way, you will need to specify the input and output types directly in the Python function.

§ @udf

§ def get\_tip\_pct(tip\_amount:float, fare\_amount:float) -> float:

§ return tip\_amount/fare\_amount

### Applying a UDF[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#applying-a-udf "Permalink to this heading")

Now that the UDF is registered, you can use it to generate new columns in your DataFrame using `withColumn()`:

§ df = df.withColumn('"tip\_pct"', get\_tip\_pct\_udf('"tip\_amount"', '"fare\_amount"' ))

After running this code, you should be able to see that the *tip\_pct* column was created in the `df` DataFrame.

## Writing a DataFrame into a Snowflake Dataset[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#writing-a-dataframe-into-a-snowflake-dataset "Permalink to this heading")

In a Python recipe, you will likely want to write Snowpark DataFrame into a Snowflake output Dataset. We recommend using the `write\_with\_schema()` method of the `DkuSnowpark` class. This method runs the `saveAsTable()` Snowpark Python method to save the contents of a DataFrame into a Snowflake table.

§ ouput\_dataset = dataiku.Dataset("my\_output\_dataset")

§ sp.write\_with\_schema(ouput\_dataset, df)

Warning

You should avoid converting a Snowpark Python DataFrame to a Pandas DataFrame before writing the output Dataset. In the following example, using the `toPandas()` method will create the Pandas DataFrame locally, further increasing memory usage and potentially leading to resource shortage issues.

§ ouput\_dataset = dataiku.Dataset("my\_output\_dataset")

§ # Load the ENTIRE DataFrame in memory (NOT optimal !!)

§ ouput\_dataset.write\_with\_schema(df.toPandas())

## Wrapping up[¶](https://developer.dataiku.com/latest/tutorials/data-engineering/snowpark-basics/index.html#wrapping-up "Permalink to this heading")

Congratulations, you now know how to work with Snowpark Python within Dataiku! To go further, here are some useful links:

* Dataiku reference documentation on the Snowpark Python integration

* Snowpark Python reference
