# Datasets (reading and writing data)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#datasets-reading-and-writing-data "Permalink to this headline")

Please see Datasets (introduction) for an introduction about interacting with datasets in Dataiku Python API

* Basic usage

* Typing of dataframes

* Chunked reading and writing with Pandas

* Encoding

* Sampling

+ head

+ random

+ random-column

+ Examples

* Getting a dataset as raw bytes

* Data interaction (dataikuapi variant)

+ Reading data (dataikuapi variant)

+ Reading data for a partition (dataikuapi variant)

## Basic usage[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#basic-usage "Permalink to this headline")

For starting code samples, please see Python recipes.

## Typing of dataframes[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#typing-of-dataframes "Permalink to this headline")

Applies when reading a dataframe.

By default, the data frame is created without explicit typing. This means that we let Pandas “guess” the proper Pandas type for each column. The main advantage of this approach is that even if your dataset only contains “string” column (which is the default on a newly imported dataset) , if the column actually contains numbers, a proper numerical type will be used.

If you pass infer\_with\_pandas=False as option to get\_dataframe(), the exact dataset types will be passed to Pandas. Note that if your dataset contains invalid values, the whole get\_dataframe call will fail.

## Chunked reading and writing with Pandas[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#chunked-reading-and-writing-with-pandas "Permalink to this headline")

When using `Dataset.get\_dataframe()`, the whole dataset (or selected partitions) are read into a single Pandas dataframe, which must fit in RAM on the DSS server.

This is sometimes inconvenient and DSS provides a way to do this by chunks:

§ mydataset = Dataset("myname")

§ for df in mydataset.iter\_dataframes(chunksize=10000):

§ # df is a dataframe of at most 10K rows.

By doing this, you only need to load a few thousands of rows at a time.

Writing in a dataset can also be made by chunks of dataframes. For that, you need to obtain a writer:

§ inp = Dataset("input")

§ out = Dataset("output")

§ with out.get\_writer() as writer:

§ for df in inp.iter\_dataframes():

§ # Process the df dataframe ...

§ # Write the processed dataframe

§ writer.write\_dataframe(df)

Note

When using chunked writing, you cannot set the schema for each chunk, you cannot use Dataset.write\_with\_schema.

Instead, you should set the schema first on the dataset object, using `Dataset.write\_schema\_from\_dataframe(first\_output\_dataframe)`

## Encoding[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#encoding "Permalink to this headline")

When dealing with both dataframes and row-by-row iteration, you must pay attention to str/unicode and encoding issues

* DSS provides dataframes where the string content is utf-8 encoded str

* When writing dataframes, DSS expects utf-8 encoded str

* Per-line iterators provide string content as unicode objects

* Per-line writers expect unicode objects.

For example, if you read from a dataframe but write row-by-row, you must decode your str into Unicode object

## Sampling[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#sampling "Permalink to this headline")

All calls to iterate the dataset (`get\_dataframe`, `iter\_dataframes`, `iter\_rows` and `iter\_tuples`) take several arguments to set sampling.

Sampling lets you only retrieve a selection of the rows of the input dataset. It’s often useful when using Pandas if your dataset does not fit in RAM.

For more information about sampling methods, please see Sampling.

The `sampling` argument takes the following values.

### head[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#head "Permalink to this headline")

Returns the first rows of the dataset. Additional arguments:

* limit=X : number of rows to read

### random[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#random "Permalink to this headline")

Returns a random sample of the dataset. Additional arguments:

* ratio=X: ratio (between 0 and 1) to select.

* OR: limit=X: number of rows to read.

### random-column[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#random-column "Permalink to this headline")

Return a column-based random sample. Additional arguments:

* sampling\_column: column to use for sampling

* ratio=X: ratio (between 0 and 1) to select

### Examples[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#examples "Permalink to this headline")

§ # Get a Dataframe over the first 3K rows

§ dataset.get\_dataframe(sampling='head', limit=3000)

§ # Iterate over a random 10% sample

§ dataset.iter\_tuples(sampling='random', ratio=0.1)

§ # Iterate over 27% of the values of column 'user\_id'

§ dataset.iter\_tuples(sampling='random-column', sampling\_column='user\_id', ratio=0.27)

§ # Get a chunked stream of dataframes over 100K randomly selected rows

§ dataset.iter\_dataframes(sampling='random', limit=100000)

## Getting a dataset as raw bytes[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#getting-a-dataset-as-raw-bytes "Permalink to this headline")

In addition to retrieving a dataset as Pandas Dataframes or iterator, you can also ask DSS for a streamed export, as formatted data.

Data can be exported by DSS in various formats: CSV, Excel, Avro, …

§ # Read a dataset as Excel, and dump to a file, chunk by chunk

§ #

§ # Very important: you MUST use a with() statement to ensure that the stream

§ # returned by raw\_formatted is closed

§ with open(target\_path, "wb") as ofl:

§ with dataset.raw\_formatted\_data(format="excel") as ifl:

§ while True:

§ chunk = ifl.read(32000)

§ if len(chunk) == 0:

§ break

§ ofl.write(chunk)

## Data interaction (dataikuapi variant)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#data-interaction-dataikuapi-variant "Permalink to this headline")

This section covers reading data using the dataikuapi pacakge. We recommend that you rather use the dataiku package for reading data.

### Reading data (dataikuapi variant)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#reading-data-dataikuapi-variant "Permalink to this headline")

The data of a dataset can be streamed with the iter\_rows() method. This call returns the raw data, so that in most cases it is necessary to first get the dataset’s schema with a call to get\_schema(). For example, printing the first 10 rows can be done with

§ columns = [column['name'] for column in dataset.get\_schema()['columns']]

§ print(columns)

§ row\_count = 0

§ for row in dataset.iter\_rows():

§ print(row)

§ row\_count = row\_count + 1

§ if row\_count >= 10:

§ break

outputs

§ ['tube\_assembly\_id', 'supplier', 'quote\_date', 'annual\_usage', 'min\_order\_quantity', 'bracket\_pricing', 'quantity', 'cost']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9059330191461']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.3412139792904']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '5', '6.60182614356538']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '10', '4.6877695119712']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '25', '3.54156118026073']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '50', '3.22440644770007']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '100', '3.08252143576504']

§ ['TA-00002', 'S-0066', '2013-07-07', '0', '0', 'Yes', '250', '2.99905966403855']

§ ['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '1', '21.9727024365273']

§ ['TA-00004', 'S-0066', '2013-07-07', '0', '0', 'Yes', '2', '12.4079833966715']

### Reading data for a partition (dataikuapi variant)[¶](https://doc.dataiku.com/dss/latest/api/python/datasets-data.html#reading-data-for-a-partition-dataikuapi-variant "Permalink to this headline")

The data of a given partition can be retrieved by passing the appropriate partition spec as parameter to iter\_rows():

§ row\_count = 0

§ for row in dataset.iter\_rows(partitions='partition\_spec1,partition\_spec2'):

§ print(row)

§ row\_count = row\_count + 1

§ if row\_count >= 10:

§ break
