# Flow creation and management[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#flow-creation-and-management "Permalink to this heading")

## Programmatically building a Flow[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#programmatically-building-a-flow "Permalink to this heading")

The flow, including datasets, recipes, … can be fully managed and created programmatically.

Datasets can be created and managed using the methods detailed in Datasets (other operations).

Recipes can be created using the `dataikuapi.dss.project.DSSProject.new\_recipe()` method. This follows a builder pattern: `new\_recipe` returns you a recipe creator object, on which you add settings, and then call the `create()` method to actually create the recipe object.

The builder objects reproduce the functionality available in the recipe creation modals in the UI, so for more control on the recipe’s setup, it is often necessary to get its settings after creation, modify it, and save it again.

### Creating a Python recipe[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#creating-a-python-recipe "Permalink to this heading")

§ builder = project.new\_recipe("python")

§ # Set the input

§ builder.with\_input("myinputdataset")

§ # Create a new managed dataset for the output in the filesystem\_managed connection

§ builder.with\_new\_output\_dataset("grouped\_dataset", "filesystem\_managed")

§ # Set the code - builder is a PythonRecipeCreator, and has a ``with\_script`` method

§ builder.with\_script("""

§ import dataiku

§ from dataiku import recipe

§ input\_dataset = recipe.get\_inputs\_as\_datasets()[0]

§ output\_dataset = recipe.get\_outputs\_as\_datasets()[0]

§ df = input\_dataset.get\_dataframe()

§ df = df.groupby("something").count()

§ output\_dataset.write\_with\_schema(df)

§ """)

§ recipe = builder.create()

§ # recipe is now a ``DSSRecipe`` representing the new recipe, and we can now run it

§ job = recipe.run()

### Creating a Sync recipe[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#creating-a-sync-recipe "Permalink to this heading")

§ builder = project.new\_recipe("sync")

§ builder = builder.with\_input("input\_dataset\_name")

§ builder = builder.with\_new\_output("output\_dataset\_name", "hdfs\_managed", format\_option\_id="PARQUET\_HIVE")

§ recipe = builder.create()

§ job = recipe.run()

### Creating and modifying a grouping recipe[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#creating-and-modifying-a-grouping-recipe "Permalink to this heading")

The recipe creation mostly handles setting up the inputs and outputs of the recipes, so most of the setup of the recipe has to be done by retrieving its settings, altering and saving them, then applying schema changes to the output

§ builder = project.new\_recipe("grouping")

§ builder.with\_input("dataset\_to\_group\_on")

§ # Create a new managed dataset for the output in the "filesystem\_managed" connection

§ builder.with\_new\_output("grouped\_dataset", "filesystem\_managed")

§ builder.with\_group\_key("column")

§ recipe = builder.build()

§ # After the recipe is created, you can edit its settings

§ recipe\_settings = recipe.get\_settings()

§ recipe\_settings.set\_column\_aggregations("myvaluecolumn", sum=True)

§ recipe\_settings.save()

§ # And you may need to apply new schemas to the outputs

§ # This will add the myvaluecolumn\_sum to the "grouped\_dataset" dataset

§ recipe.compute\_schema\_updates().apply()

§ # It should be noted that running a recipe is equivalent to building its output(s)

§ job = recipe.run()

### A complete example[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#a-complete-example "Permalink to this heading")

This examples shows a complete chain:

* Creating an external dataset

* Automatically detecting the settings of the dataset (see Datasets (other operations) for details)

* Creating a prepare recipe to cleanup the dataset

* Then chaining a grouping recipe, setting an aggregation on it

* Running the entire chain

§ dataset = project.create\_sql\_table\_dataset("mydataset", "PostgreSQL", "my\_sql\_connection", "mytable", "myschema")

§ dataset\_settings = dataset.autodetect\_settings()

§ dataset\_settings.save()

§ # As a shortcut, we can call new\_recipe on the DSSDataset object. This way, we don't need to call "with\_input"

§ prepare\_builder = dataset.new\_recipe("prepare")

§ prepare\_builder.with\_new\_output("mydataset\_cleaned", "filesystem\_managed")

§ prepare\_recipe = prepare\_builder.create()

§ # Add a step to clean values in "doublecol" that are not valid doubles

§ prepare\_settings = prepare\_recipe.get\_settings()

§ prepare\_settings.add\_filter\_on\_bad\_meaning("DoubleMeaning", "doublecol")

§ prepare\_settings.save()

§ prepare\_recipe.compute\_schema\_updates().apply()

§ prepare\_recipe().run()

§ # Grouping recipe

§ grouping\_builder = project.new\_recipe("grouping")

§ grouping\_builder.with\_input("mydataset\_cleaned")

§ grouping\_builder.with\_new\_output("mydataset\_cleaned\_grouped", "filesystem\_managed")

§ grouping\_builder.with\_group\_key("column")

§ grouping\_recipe = grouping\_builder.build()

§ grouping\_recipe\_settings = grouping\_recipe.get\_settings()

§ grouping\_recipe\_settings.set\_column\_aggregations("myvaluecolumn", sum=True)

§ grouping\_recipe\_settings.save()

§ grouping\_recipe\_settings.compute\_schema\_updates().apply()

§ grouping\_recipe\_settings.run()

## Working with flow zones[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#working-with-flow-zones "Permalink to this heading")

### Creating a zone and adding items in it[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#creating-a-zone-and-adding-items-in-it "Permalink to this heading")

§ flow = project.get\_flow()

§ zone = flow.create\_zone("zone1")

§ # First way of adding an item to a zone

§ dataset = project.get\_dataset("mydataset")

§ zone.add\_item(dataset)

§ # Second way of adding an item to a zone

§ dataset = project.get\_dataset("mydataset")

§ dataset.move\_to\_zone("zone1")

§ # Third way of adding an item to a zone

§ dataset = project.get\_dataset("mydataset")

§ dataset.move\_to\_zone(zone)

### Listing and getting zones[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#listing-and-getting-zones "Permalink to this heading")

§ # List zones

§ for zone in flow.list\_zones()

§ print("Zone id=%s name=%s" % (zone.id, zone.name))

§ print("Zone has the following items:")

§ for item in zone.items:

§ print("Zone item: %s" % item)

§ # Get a zone by id - beware, id not name

§ zone = flow.get\_zone("21344ZsQZ")

§ # Get the "Default" zone

§ zone = flow.get\_default\_zone()

### Changing the settings of a zone[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#changing-the-settings-of-a-zone "Permalink to this heading")

§ flow = project.get\_flow()

§ zone = flow.get\_zone("21344ZsQZ")

§ settings = zone.get\_settings()

§ settings.name = "New name"

§ settings.save()

### Getting the zone of a dataset[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#getting-the-zone-of-a-dataset "Permalink to this heading")

§ dataset = project.get\_dataset("mydataset")

§ zone = dataset.get\_zone()

§ print("Dataset is in zone %s" % zone.id)

## Navigating the flow graph[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#navigating-the-flow-graph "Permalink to this heading")

DSS builds the Flow graph dynamically by enumerating datasets, folders, models and recipes and linking all together through the inputs and outputs of the recipes. Since navigating this can be complex, the `dataikuapi.dss.flow.DSSProjectFlow` class gives you access to helpers for this

### Finding sources of the Flow[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#finding-sources-of-the-flow "Permalink to this heading")

§ flow = project.get\_flow()

§ graph = flow.get\_graph()

§ for source in graph.get\_source\_computables(as\_type="object"):

§ print("Flow graph has source: %s" % source)

### Enumerating the graph in order[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#enumerating-the-graph-in-order "Permalink to this heading")

This method will return all items in the graph, “from left to right”. Each item is returned as a `DSSDataset`, `DSSManagedFolder`, `DSSSavedModel`, `DSSStreamingEndpoint` or `DSSRecipe`

§ flow = project.get\_flow()

§ graph = flow.get\_graph()

§ for item in graph.get\_items\_in\_traversal\_order(as\_type="object"):

§ print("Next item in the graph is %s" % item)

### Replacing an input everywhere in the graph[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#replacing-an-input-everywhere-in-the-graph "Permalink to this heading")

This method allows you to replace an input (dataset for example) in every recipe where it appears as a input

§ flow = project.get\_flow()

§ flow.replace\_input\_computable("old\_dataset", "new\_dataset")

§ # Or to replace a managed folder

§ flow.replace\_input\_computable("oldid", "newid", type="MANAGED\_FOLDER")

## Schema propagation[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#schema-propagation "Permalink to this heading")

When the schema of an input dataset is modified, or when the settings of a recipe are modified, you need to propagate this schema change across the flow.

This can be done from the UI, but can also be automated through the API

§ flow = project.get\_flow()

§ # A propagation always starts from a source dataset and will move from left to right till the end of the Flow

§ propagation = flow.new\_schema\_propagation("sourcedataset")

§ future = propagation.start()

§ future.wait\_for\_result()

There are many options for propagation, see `dataikuapi.dss.flow.DSSSchemaPropagationRunBuilder`

## Exporting a flow documentation[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#exporting-a-flow-documentation "Permalink to this heading")

This sample shows how to generate and download a flow documentation from a template.

See Flow Document Generator for more information.

§ # project is a DSSProject object

§ flow = project.get\_flow()

§ # Launch the flow document generation by either

§ # using the default template by calling without arguments

§ # or specifying a managed folder id and the path to the template to use in that folder

§ future = flow.generate\_documentation(FOLDER\_ID, "path/my\_template.docx")

§ # Alternatively, use a custom uploaded template file

§ with open("my\_template.docx", "rb") as f:

§ future = flow.generate\_documentation\_from\_custom\_template(f)

§ # Wait for the generation to finish, retrieve the result and download the generated

§ # flow documentation to the specified file

§ result = future.wait\_for\_result()

§ export\_id = result["exportId"]

§ flow.download\_documentation\_to\_file(export\_id, "path/my\_flow\_documentation.docx")

## Detailed examples[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#detailed-examples "Permalink to this heading")

This section contains more advanced examples on Flow-based operations.

### Delete orphaned Datasets[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#delete-orphaned-datasets "Permalink to this heading")

It can happen that after some operations on a Flow one or more Datasets end up not being linked to any Recipe and thus become disconnected from the Flow branches. In order to programmatically remove those Datasets from the Flow, you can list nodes that have neither predecessor nor successor in the graph using the following function:

§ def delete\_orphaned\_datasets(project, drop\_data=False, dry\_run=True):

§ """Delete datasets that are not linked to any recipe.

§ """

§ flow = project.get\_flow()

§ graph = flow.get\_graph()

§ cpt = 0

§ for name, props in graph.nodes.items():

§ if not props["predecessors"] and not props["successors"]:

§ print(f"- Deleting {name}...")

§ ds = project.get\_dataset(name)

§ if not dry\_run:

§ ds.delete(drop\_data=drop\_data)

§ cpt +=1

§ else:

§ print("Dry run: nothing was deleted.")

§ print(f"{cpt} datasets deleted.")

Attention

Note that the function has additional flags with default values set up to prevent accidental data deletion. Even so, we recommend you to remain extra cautious when clearing/deleting Datasets.

## Reference documentation[¶](https://developer.dataiku.com/latest/concepts-and-examples/flow.html#reference-documentation "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.flow.DSSProjectFlow`(client, ...) |  |

| `dataikuapi.dss.flow.DSSProjectFlowGraph`(...) |  |

| `dataikuapi.dss.flow.DSSFlowZone`(flow, data) | A zone in the Flow. |

| `dataikuapi.dss.flow.DSSFlowZoneSettings`(zone) | The settings of a flow zone. |

| `dataikuapi.dss.flow.DSSSchemaPropagationRunBuilder`(...) | Do not create this directly, use `DSSProjectFlow.new\_schema\_propagation()` |
