# Reading or writing a dataset with custom Python code[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#reading-or-writing-a-dataset-with-custom-python-code "Permalink to this headline")

When you use a Python recipe to transform a dataset in Dataiku DSS, you generally use the DSS Python API to read and write to the dataset.

This DSS API provides an easy way to read or write datasets, regardless of their size or data store. This way, you don’t need to install specific packages for interacting with each data store, or learn specific APIs.

There are some cases, however, where the DSS API does not provide enough flexibility, and you want to use the specific API or package for your datastore.

Some use cases could include:

* You want to read data, which is stored in a MongoDB collection with a specific filter, which is not represented in the filter for the input dataset.

* You want to “upsert” data into the output dataset (i.e., insert, update, or remove records based on a primary key).

The usage of the DSS API is by no means mandatory. You can read data and write data however you want. If you don’t call the `get\_dataframe` or `iter\_tuples` methods, DSS will not read any data, nor load anything in memory from the datastore.

Similarly, you don’t have to use the `write\_dataframe` or `get\_writer` API to write data in the output. Even if you use a writer that DSS does not know about (for example, the `pymongo` package for MongoDB), the recipe will work properly, and DSS will know that the dataset has been changed.

## Accessing info about datasets[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#accessing-info-about-datasets "Permalink to this headline")

You generally want to avoid hard-coding connection information, table names, etc. in your recipe code. DSS can give you some connection / location information about the datasets that you are trying to read or write.

For all datasets, you can use the `dataset.get\_location\_info()` method. It returns a structure containing an `info` dict. The keys in the `info` dict depend on the specific kind of dataset. Print the dict to see more (NB: you can do that in a Jupyter notebook). Here are a few examples:

§ # myfs is a Filesystem dataset

§ dataset = dataiku.Dataset("myfs")

§ locinfo = dataset.get\_location\_info()

§ print locinfo["info"]

§ {

§ "path" : "/data/input/myfs"

§ }

§ # sql is a PostgreSQL dataset

§ dataset = dataiku.Dataset("sql")

§ locinfo = dataset.get\_location\_info()

§ print locinfo["info"]

§ {

§ "databaseType" : "PostgreSQL",

§ "schema" : "public",

§ "table" : "mytablename"

§ }

In addition, for “Filesystem-like” datasets (Filesystem, HDFS, S3, etc.), you can use the `get\_files\_info()` method to get details about all files in a dataset (or partition).

§ dataset = dataiku.Dataset("non\_partitioned\_fs")

§ fi = dataset.get\_files\_info()

§ for filepath in fi["globalPaths"]:

§ # Returns a path relative to the root path of the dataset.

§ # The root path of the dataset is returned by get\_location\_info

§ print filepath["path"]

§ # Size in bytes of that file

§ print filepath["size"]

§ dataset = dataiku.Dataset("partitioned\_fs")

§ fi = dataset.get\_files\_info()

§ for (partition\_id, partition\_filepaths) in fi["pathsByPartition"].items():

§ print partition\_id

§ for filepath in partition\_filepaths:

§ # Returns a path relative to the root path of the dataset.

§ # The root path of the dataset is returned by get\_location\_info

§ print filepath["path"]

§ # Size in bytes of that file

§ print filepath["size"]

## Partitioned datasets[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#partitioned-datasets "Permalink to this headline")

If your recipe deals with partitioned datasets, in input or output, you need to be careful about reading and/or writing the correct data.

### Reading and Writing[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#reading-and-writing "Permalink to this headline")

If your recipe deals with partitioned datasets, in input or output, you don’t need to specify the source or destination partitions in your code. Reading and writing is done through Dataiku DSS.

To read from or write to the input partitions (as defined by the partition dependencies), use “get\_dataframe()”. This will automatically give you the relevant partitions.

### Other Purposes[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#other-purposes "Permalink to this headline")

For purposes other than reading or writing dataframes, you can access the partition name (as well as any other variables) you want to build using the Python dictionary called “dku\_flow\_variables”. This dictionary can be accessed using `dataiku.dku\_flow\_variables`, as described in the product documentation.

Note

*dataset.get\_write\_partition()* is deprecated.

## Related resources[¶](https://knowledge.dataiku.com/latest/courses/advanced-code/python/read-write-custom-code.html#related-resources "Permalink to this headline")

* Python recipes

* Partitioning
