# Visual Machine learning[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#visual-machine-learning "Permalink to this heading")

Through the public API, the Python client allows you to automate all the aspects of the lifecycle of machine learning models.

* Creating a visual analysis and ML task

* Tuning settings

* Training models

* Inspecting model details and results

* Deploying saved models to Flow and retraining them

## Concepts[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#concepts "Permalink to this heading")

In DSS, you train models as part of a *visual analysis*. A visual analysis is made of a preparation script, and one or several *ML Tasks*.

A ML Task is an individual section in which you train models. A ML Task is either a prediction of a single target variable, or a clustering.

The ML API allows you to manipulate ML Tasks, and use them to train models, inspect their details, and deploy them to the Flow.

Once deployed to the Flow, the *Saved model* can be retrained by the usual build mechanism of DSS.

A ML Task has settings, which control:

* Which features are active

* The preprocessing settings for each features

* Which algorithms are active

* The hyperparameter settings (including grid searched hyperparameters) for each algorithm

* The settings of the grid search

* Train/Test splitting settings

* Feature selection and generation settings

## Usage samples[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#usage-samples "Permalink to this heading")

### The whole cycle[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#the-whole-cycle "Permalink to this heading")

This examples create a prediction task, enables an algorithm, trains it, inspects models, and deploys one of the model to Flow

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ # Create a new ML Task to predict the variable "target" from "trainset"

§ mltask = p.create\_prediction\_ml\_task(

§ input\_dataset="trainset",

§ target\_variable="target",

§ ml\_backend\_type='PY\_MEMORY', # ML backend to use

§ guess\_policy='DEFAULT' # Template to use for setting default parameters

§ )

§ # Wait for the ML task to be ready

§ mltask.wait\_guess\_complete()

§ # Obtain settings, enable GBT, save settings

§ settings = mltask.get\_settings()

§ settings.set\_algorithm\_enabled("GBT\_CLASSIFICATION", True)

§ settings.save()

§ # Start train and wait for it to be complete

§ mltask.start\_train()

§ mltask.wait\_train\_complete()

§ # Get the identifiers of the trained models

§ # There will be 3 of them because Logistic regression and Random forest were default enabled

§ ids = mltask.get\_trained\_models\_ids()

§ for id in ids:

§ details = mltask.get\_trained\_model\_details(id)

§ algorithm = details.get\_modeling\_settings()["algorithm"]

§ auc = details.get\_performance\_metrics()["auc"]

§ print("Algorithm=%s AUC=%s" % (algorithm, auc))

§ # Let's deploy the first model

§ model\_to\_deploy = ids[0]

§ ret = mltask.deploy\_to\_flow(model\_to\_deploy, "my\_model", "trainset")

§ print("Deployed to saved model id = %s train recipe = %s" % (ret["savedModelId"], ret["trainRecipeName"]))

The methods for creating prediction and clustering ML tasks are defined at `dataikuapi.dss.project.DSSProject.create\_prediction\_ml\_task()` and `dataikuapi.dss.project.DSSProject.create\_clustering\_ml\_task()`.

### Obtaining a handle to an existing ML Task[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#obtaining-a-handle-to-an-existing-ml-task "Permalink to this heading")

When you create these ML tasks, the returned `dataikuapi.dss.ml.DSSMLTask` object will contain two fields `analysis\_id` and `mltask\_id` that can later be used to retrieve the same `DSSMLTask` object

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ mltask = p.get\_ml\_task(analysis\_id, mltask\_id)

### Tuning feature preprocessing[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#tuning-feature-preprocessing "Permalink to this heading")

#### Enabling and disabling features[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#enabling-and-disabling-features "Permalink to this heading")

§ # mltask is a DSSMLTask object

§ settings = mltask.get\_settings()

§ settings.reject\_feature("not\_useful")

§ settings.use\_feature("useful")

§ settings.save()

#### Changing advanced parameters for a feature[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#changing-advanced-parameters-for-a-feature "Permalink to this heading")

§ # mltask is a DSSMLTask object

§ settings = mltask.get\_settings()

§ # Use impact coding rather than dummy-coding

§ fs = settings.get\_feature\_preprocessing("mycategory")

§ fs["category\_handling"] = "IMPACT"

§ # Impute missing with most frequent value

§ fs["missing\_handling"] = "IMPUTE"

§ fs["missing\_impute\_with"] = "MODE"

§ settings.save()

### Tuning algorithms[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#tuning-algorithms "Permalink to this heading")

#### Global parameters for hyperparameter search[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#global-parameters-for-hyperparameter-search "Permalink to this heading")

This sample shows how to modify the parameters of the search to be performed on the hyperparameters.

§ # mltask is a DSSMLTask object

§ settings = mltask.get\_settings()

§ hp\_search\_settings = mltask\_settings.get\_hyperparameter\_search\_settings()

§ # Set the search strategy either to "GRID", "RANDOM" or "BAYESIAN"

§ hp\_search\_settings.strategy = "RANDOM"

§ # Alternatively use a setter, either set\_grid\_search

§ # set\_random\_search or set\_bayesian\_search

§ hp\_search\_settings.set\_random\_search(seed=1234)

§ # Set the validation mode either to "KFOLD", "SHUFFLE" (or accordingly their

§ # "TIME\_SERIES"-prefixed counterpart) or "CUSTOM"

§ hp\_search\_settings.validation\_mode = "KFOLD"

§ # Alternatively use a setter, either set\_kfold\_validation, set\_single\_split\_validation

§ # or set\_custom\_validation

§ hp\_search\_settings.set\_kfold\_validation(n\_folds=5, stratified=True)

§ # Save the settings

§ settings.save()

#### Algorithm specific hyperparameter search[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#algorithm-specific-hyperparameter-search "Permalink to this heading")

This sample shows how to modify the settings of the Random Forest Classification algorithm, where two kinds of hyperparameters (multi-valued numerical and single-valued) are introduced.

§ # mltask is a DSSMLTask object

§ settings = mltask.get\_settings()

§ rf\_settings = settings.get\_algorithm\_settings("RANDOM\_FOREST\_CLASSIFICATION")

§ # rf\_settings is an object representing the settings for this algorithm.

§ # The 'enabled' attribute indicates whether this algorithm will be trained.

§ # Other attributes are the various hyperparameters of the algorithm.

§ # The precise hyperparameters for each algorithm are not all documented, so let's

§ # print the dictionary keys to see available hyperparameters.

§ # Alternatively, tab completion will provide relevant hints to available hyperparameters.

§ print(rf\_settings.keys())

§ # Let's first have a look at rf\_settings.n\_estimators which is a multi-valued hyperparameter

§ # represented as a NumericalHyperparameterSettings object

§ print(rf\_settings.n\_estimators)

§ # Set multiple explicit values for "n\_estimators" to be explored during the search

§ rf\_settings.n\_estimators.definition\_mode = "EXPLICIT"

§ rf\_settings.n\_estimators.values = [100, 200]

§ # Alternatively use the set\_values setter

§ rf\_settings.n\_estimators.set\_values([100, 200])

§ # Set a range of values for "n\_estimators" to be explored during the search

§ rf\_settings.n\_estimators.definition\_mode = "RANGE"

§ rf\_settings.n\_estimators.range.min = 10

§ rf\_settings.n\_estimators.range.max = 100

§ rf\_settings.n\_estimators.range.nb\_values = 5  # Only relevant for grid-search

§ # Alternatively, use the set\_range setter

§ rf\_settings.n\_estimators.set\_range(min=10, max=100, nb\_values=5)

§ # Let's now have a look at rf\_settings.selection\_mode which is a single-valued hyperparameter

§ # represented as a SingleCategoryHyperparameterSettings object.

§ # The object stores the valid options for this hyperparameter.

§ print(rf\_settings.selection\_mode)

§ # Features selection mode is not multi-valued so it's not actually searched during the

§ # hyperparameter search

§ rf\_settings.selection\_mode = "sqrt"

§ # Save the settings

§ settings.save()

The next sample shows how to modify the settings of the Logistic Regression classification algorithm, where a new kind of hyperparameter (multi-valued categorical) is introduced.

§ # mltask is a DSSMLTask object

§ settings = mltask.get\_settings()

§ logit\_settings = settings.get\_algorithm\_settings("LOGISTIC\_REGRESSION")

§ # Let's have a look at logit\_settings.penalty which is a multi-valued categorical

§ # hyperparameter represented as a CategoricalHyperparameterSettings object

§ print(logit\_settings.penalty)

§ # List currently enabled values

§ print(logit\_settings.penalty.get\_values())

§ # List all possible values

§ print(logit\_settings.penalty.get\_all\_possible\_values())

§ # Set the values for the "penalty" hyperparameter to be explored during the search

§ logit\_settings.penalty = ["l1", "l2"]

§ # Alternatively use the set\_values setter

§ logit\_settings.penalty.set\_values(["l1", "l2"])

§ # Save the settings

§ settings.save()

### Exporting a model documentation[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#exporting-a-model-documentation "Permalink to this heading")

This sample shows how to generate and download a model documentation from a template.

See Model Document Generator for more information.

§ # mltask is a DSSMLTask object

§ details = mltask.get\_trained\_model\_details(id)

§ # Launch the model document generation by either

§ # using the default template for this model by calling without argument

§ # or specifying a managed folder id and the path to the template to use in that folder

§ future = details.generate\_documentation(FOLDER\_ID, "path/my\_template.docx")

§ # Alternatively, use a custom uploaded template file

§ with open("my\_template.docx", "rb") as f:

§ future = details.generate\_documentation\_from\_custom\_template(f)

§ # Wait for the generation to finish, retrieve the result and download the generated

§ # model documentation to the specified file

§ result = future.wait\_for\_result()

§ export\_id = result["exportId"]

§ details.download\_documentation\_to\_file(export\_id, "path/my\_model\_documentation.docx")

### Using a model in a Python recipe or notebook[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#using-a-model-in-a-python-recipe-or-notebook "Permalink to this heading")

Once a Saved Model has been deployed to the Flow, the normal way to use it is to use scoring recipes.

However, you can also use the `dataiku.Model` class in a Python recipe or notebook to directly score records.

This method has a number of limitations:

* It cannot be used together with containerized execution

* It is not compatible with Partitioned models

§ import dataiku

§ m = dataiku.Model(my\_model\_id)

§ my\_predictor = m.get\_predictor()

§ predicted\_dataframe = my\_predictor.predict(input\_dataframe)

## Detailed examples[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#detailed-examples "Permalink to this heading")

This section contains more advanced examples using ML Tasks and Saved Models.

### Deploy best MLTask model to the Flow[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#deploy-best-mltask-model-to-the-flow "Permalink to this heading")

After training several models in a ML Task you can programmatically deploy the best one by creating a new Saved Model or updating an existing one. In the following example:

* The `deploy\_with\_best\_model()` function creates a new Saved Model with the input MLTask’s best model

* The `update\_with\_best\_model()` function updates an existing Saved Model with the MLTask’s best model.

Both functions rely on `dataikuapi.dss.ml.DSSMLTask` and `dataikuapi.dss.savedmodel.DSSSavedModel`.

§ def get\_best\_model(project, analysis\_id, ml\_task\_id, metric):

§ analysis = project.get\_analysis(analysis\_id)

§ ml\_task = analysis.get\_ml\_task(ml\_task\_id)

§ trained\_models = ml\_task.get\_trained\_models\_ids()

§ trained\_models\_snippets = [ml\_task.get\_trained\_model\_snippet(m) for m in trained\_models]

§ # Assumes that for your metric, "higher is better"

§ best\_model\_snippet = max(trained\_models\_snippets, key=lambda x:x[metric])

§ best\_model\_id = best\_model\_snippet["fullModelId"]

§ return ml\_task, best\_model\_id

§ def deploy\_with\_best\_model(project,

§ analysis\_id,

§ ml\_task\_id,

§ metric,

§ saved\_model\_name,

§ training\_dataset):

§ """Create a new Saved Model in the Flow with the 'best model' of a MLTask.

§ """

§ ml\_task, best\_model\_id = get\_best\_model(project,

§ analysis\_id,

§ ml\_task\_id,

§ metric)

§ ml\_task.deploy\_to\_flow(best\_model\_id,

§ saved\_model\_name,

§ training\_dataset)

§ def update\_with\_best\_model(project,

§ analysis\_id,

§ ml\_task\_id,

§ metric,

§ saved\_model\_name,

§ activate=True):

§ """Update an existing Saved Model in the Flow with the 'best model'

§ of a MLTask.

§ """

§ ml\_task, best\_model\_id = get\_best\_model(project,

§ analysis\_id,

§ ml\_task\_id,

§ metric)

§ training\_recipe\_name = f"train\_{saved\_model\_name}"

§ ml\_task.redeploy\_to\_flow(model\_id=best\_model\_id,

§ recipe\_name=training\_recipe\_name,

§ activate=activate)

### List details of all Saved Models[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#list-details-of-all-saved-models "Permalink to this heading")

You can retrieve, for each Saved Model in a Project, the current model algorithm and performances. In the following example, the `get\_project\_saved\_models()` function outputs a Python dictionary with several details on the current activeversions of all Saved Models in the target Project.

§ def explore\_saved\_models(client=None, project\_key=None):

§ """List saved models of a project and give details on the active versions.

§ Args:

§ client: A handle on the target DSS instance

§ project\_key: A string representing the target project key

§ Returns:

§ smdl\_list: A dict with all saved model ids and perf + algorithm

§ for the active versions.

§ """

§ smdl\_list = []

§ prj = client.get\_project(project\_key)

§ smdl\_ids = [x["id"] for x in prj.list\_saved\_models()]

§ for smdl in smdl\_ids:

§ data = {}

§ obj = prj.get\_saved\_model(smdl)

§ data["version\_ids"] = [m["id"] for m in obj.list\_versions()]

§ active\_version\_id = obj.get\_active\_version()["id"]

§ active\_version\_details = obj.get\_version\_details(active\_version\_id)

§ data["active\_version"] = {"id": active\_version\_id,

§ "algorithm": active\_version\_details.details["actualParams"]["resolved"]["algorithm"],

§ "performance\_metrics": active\_version\_details.get\_performance\_metrics()}

§ smdl\_list.append(data)

§ return smdl\_list

### List version details of a given Saved Model[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#list-version-details-of-a-given-saved-model "Permalink to this heading")

This code snippet allows you to retrieve a summary of all versions of a given Saved Model (algorithm, hyperparameters, performance, features) using `dataikuapi.dss.savedmodel.DSSSavedModel`.

§ import copy

§ from dataiku import recipe

§ def export\_saved\_model\_metadata(project, saved\_model\_id):

§ """

§ """

§ model = project.get\_saved\_model(saved\_model\_id)

§ output = []

§ for version in model.list\_versions():

§ version\_details = model.get\_version\_details(version["id"])

§ version\_dict = {}

§ # Retrieve algorithm and hyperarameters

§ resolved = copy.deepcopy(version\_details.get\_actual\_modeling\_params()["resolved"])

§ version\_dict["algorithm"] = resolved["algorithm"]

§ del resolved["algorithm"]

§ del resolved["skipExpensiveReports"]

§ for (key, hyperparameters) in resolved.items():

§ for (hyperparameter\_key, hyperparameter\_value) in hyperparameters.items():

§ version\_dict["hyperparameter\_%s" % hyperparameter\_key] = hyperparameter\_value

§ # Retrieve test performance

§ for (metric\_key, metric\_value) in version\_details.get\_performance\_metrics().items():

§ version\_dict["test\_perf\_%s" % metric\_key] = metric\_value

§ # Retrieve lineage

§ version\_dict["training\_target\_variable"] = version\_details.details["coreParams"]["target\_variable"]

§ split\_desc = version\_details.details["splitDesc"]

§ version\_dict["training\_train\_rows"] = split\_desc["trainRows"]

§ version\_dict["training\_test\_rows"] = split\_desc["testRows"]

§ training\_used\_features = []

§ for (key, item) in version\_details.get\_preprocessing\_settings()["per\_feature"].items():

§ if item["role"] == "INPUT":

§ training\_used\_features.append(key)

§ version\_dict["training\_used\_features"] = ",".join(training\_used\_features)

§ # Retrieve training time

§ ti = version\_details.get\_train\_info()

§ version\_dict["training\_total\_time"] = int((ti["endTime"] - ti["startTime"])/1000)

§ version\_dict["training\_preprocessing\_time"] = int(ti["preprocessingTime"]/1000)

§ version\_dict["training\_training\_time"] = int(ti["trainingTime"]/1000)

§ output.append(version\_dict)

§ return output

### Retrieve linear model coefficients[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#retrieve-linear-model-coefficients "Permalink to this heading")

You can retrieve the list of coefficient names and values from a Saved Model version for compatible algorithms.

§ def get\_model\_coefficients(project, saved\_model\_id, version\_id):

§ """

§ Returns a dictionary with key="coefficient name" and value=coefficient

§ """

§ model = project.get\_saved\_model(saved\_model\_id)

§ if version\_id is None:

§ version\_id = model.get\_active\_version().get('id')

§ details = model.get\_version\_details(version\_id)

§ details\_lr = details.details.get('iperf', {}).get('lmCoefficients', {})

§ rescaled\_coefs = details\_lr.get('rescaledCoefs', [])

§ variables = details\_lr.get('variables',[])

§ coef\_dict = {var: coef for var, coef in zip(variables, rescaled\_coefs)}

§ if len(coef\_dict)==0:

§ print(f"Model {saved\_model\_id} and version {version\_id} does not have coefficients")

§ return coef\_dict

### Export model[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#export-model "Permalink to this heading")

You can programmatically export the best version of a Saved Model as either a Python function or a MLFlow model. In the following example, the `get\_best\_classifier\_version()` function returns the best version id of the classifier.

Pass that id to the `dataikuapi.dss.savedmodel.DSSSavedModel.get\_version\_details()` method to get a `dataikuapi.dss.ml.DSSTrainedPredictionModelDetails` handle.

Then either use `get\_scoring\_python()` or `get\_scoring\_mlflow()` to download the model archive to a given file name in either Python or MLflow, respectively.

§ import dataiku

§ PROJECT\_KEY = 'YOUR\_PROJECT\_KEY'

§ METRIC = 'auc' # or any classification metrics of interest.

§ SAVED\_MODEL\_ID = 'YOUR\_SAVED\_MODEL\_ID'

§ FILENAME = 'path/to/model-archive.zip'

§ def get\_best\_classifier\_version(project, saved\_model\_id, metric):

§ """

§ This function returns the best version id of a

§ given DSS classifier model in a project.

§ """

§ model = project.get\_saved\_model(saved\_model\_id)

§ outcome = []

§ for version in model.list\_versions():

§ version\_id = version.get('id')

§ version\_details = model.get\_version\_details(version\_id)

§ perf = version\_details.get\_raw\_snippet().get(metric)

§ outcome.append((version\_id, perf))

§ # get the best version id. User reverse=False if

§ # lower metric means better

§ best\_version\_id = sorted(

§ outcome, key = lambda x: x[1], reverse=True)[0][0]

§ return best\_version\_id

§ client = dataiku.api\_client()

§ project = client.get\_project(PROJECT\_KEY)

§ model = project.get\_saved\_model(SAVED\_MODEL\_ID)

§ best\_version\_id = get\_best\_classifier\_version(project, SAVED\_MODEL\_ID, METRIC)

§ version\_details = model.get\_version\_details(best\_version\_id)

§ # Export in Python

§ version\_details.get\_scoring\_python(FILENAME)

§ # Export in MLflow format

§ version\_details.get\_scoring\_mlflow(FILENAME)

## Reference documentation[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#reference-documentation "Permalink to this heading")

### Interaction with a ML Task[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#interaction-with-a-ml-task "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.ml.DSSMLTask`(client, ...) |  |

### Manipulation of settings[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#manipulation-of-settings "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.ml.DSSMLTaskSettings`(client, ...) | Object to read and modify the settings of a ML task. |

| `dataikuapi.dss.ml.PredictionSplitParamsHandler`(...) | Object to modify the train/test splitting params. |

### Exploration of results[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#exploration-of-results "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.ml.DSSTrainedPredictionModelDetails`(...) | Object to read details of a trained prediction model |

| `dataikuapi.dss.ml.DSSTrainedClusteringModelDetails`(...) | Object to read details of a trained clustering model |

### MLflow models[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#mlflow-models "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.savedmodel.MLFlowVersionSettings`(...) | Handle for the settings of an imported MLFlow model version |

### dataiku.model[¶](https://developer.dataiku.com/latest/concepts-and-examples/ml.html#dataiku-model "Permalink to this heading")
