# Model Evaluation Stores[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#model-evaluation-stores "Permalink to this heading")

Through the public API, the Python client allows you to perform evaluation of models. Those models are typically models trained in the Lab, and then deployed to the Flow as Saved Models (see Visual Machine learning for additional information). They can also be external models.

## Concepts[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#concepts "Permalink to this heading")

### With a DSS model[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#with-a-dss-model "Permalink to this heading")

In DSS, you can *evaluate* a *version* of a *Saved Model* using an *Evaluation Recipe*. An Evaluation Recipe takes as input a Saved Model and a *Dataset* on which to perform this evaluation. An Evaluation Recipe can have three outputs:

* an *output* dataset,

* a *metrics* dataset, or

* a *Model Evaluation Store* (MES).

By default, the *active* version of the Saved Model is evaluated. This can be configured in the Evaluation Recipe.

If a MES is configured as an output, a *Model Evaluation* (ME) will be written in the MES each time the MES is built (or each time the Evaluation Recipe is run).

A Model Evaluation is a container for metrics of the evaluation of the Saved Model Version on the Evaluation Dataset. Those metrics include:

* all available *performance* metrics,

* the *Data Drift* metric.

The Data Drift metric is the accuracy of a model trained to recognize lines:

* from the evaluation dataset

* from the *train time test dataset* of the configured version of the Saved Model.

The higher this metric, the better the model can separate lines from the evaluation dataset from those from the train time test dataset. And so, the more data from the evaluation dataset is different from train time data.

Detailed information and other tools, including a binomial test, univariate data drift, and feature drift importance, are available in the *Input Data Drift* tab of a Model Evaluation. Note that this tool is interactive and that displayed results are not persisted.

### With an external model[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#with-an-external-model "Permalink to this heading")

In DSS, you can also evaluate an *external model* using a *Standalone Evaluation Recipe*. A *Standalone Evaluation Recipe* (SER) takes as input a labeled dataset containing labels, predictions, and (optionally) weights. A SER takes a single output: a Model Evaluation Store.

As the Evaluation Recipe, the Standalone Evaluation Recipe will output a Model Evaluation to the configured Model Evaluation Store each time it runs. In this case, however, the Data Drift can not be computed as there is no notion of reference data.

### How evaluation is performed[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#how-evaluation-is-performed "Permalink to this heading")

The Evaluation Recipe and its counterpart for external models, the Standalone Evaluation Recipe, perform the evaluation on a sample of the Evaluation Dataset. The sampling parameters are defined in the recipe. Note that the sample will contain at most 20,000 lines.

Performance metrics are then computed on this sample.

Data drift can be computed in three ways:

* at evaluation time, between the evaluation dataset and the train time test dataset;

* using the API, between the samples of a Model Evaluation, a Saved Model Version (sample of train time test dataset) or a Lab Model (sample of train time test dataset);

* interactively, in the “Input data drift” tab of a Model Evaluation.

In all cases, to compute the Data Drift, the sample of the Model Evaluation and a sample of the reference data are concatenated. In order to balance the data, those samples are truncated to the length of the smallest one. If the size of the reference sample if higher than the size of the ME sample, the reference sample will be truncated.

So:

* at evaluation time, we shall take as input the sample of the Model Evaluation (whose length is at most 20,000 lines) and a sample of the train time test dataset;

* interactively, the sample of the reference model evaluation and:

>

>

> 	+ if the other compared item is an ME, its sample;

> 	+ if the other compared item is a Lab Model or an SMV, a sample of its train time test dataset.

>

### Limitations[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#limitations "Permalink to this heading")

Model Evaluation Stores cannot be used with:

* clustering models,

* ensembling models,

* partitioned models.

Compatible prediction models have to be Python models.

## Usage samples[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#usage-samples "Permalink to this heading")

### Create a Model Evaluation Store[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#create-a-model-evaluation-store "Permalink to this heading")

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ mes\_id = p.create\_model\_evaluation\_store("My Mes Name")

Note that the display name of a Model Evaluation Store (in the above sample *My Mes Name*) is distinct from its unique id.

### Retrieve a Model Evaluation Store[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#retrieve-a-model-evaluation-store "Permalink to this heading")

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ mes\_id = p.get\_model\_evaluation\_store("mes\_id")

### List Model Evaluation Stores[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#list-model-evaluation-stores "Permalink to this heading")

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ stores =  p.list\_model\_evaluation\_stores(as\_type="objects")

### Create an Evaluation Recipe[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#create-an-evaluation-recipe "Permalink to this heading")

See `dataikuapi.dss.recipe.EvaluationRecipeCreator`

### Build a Model Evaluation Store and retrieve the performance and data drift metrics of the just computed ME[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#build-a-model-evaluation-store-and-retrieve-the-performance-and-data-drift-metrics-of-the-just-computed-me "Permalink to this heading")

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ mes = project.get\_model\_evaluation\_store("M3s\_1d")

§ mes.build()

§ me = mes.get\_latest\_model\_evaluation()

§ full\_info = me.get\_full\_info()

§ metrics = full\_info.metrics

### List Model Evaluations from a store[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#list-model-evaluations-from-a-store "Permalink to this heading")

§ # client is a DSS API client

§ p = client.get\_project("MYPROJECT")

§ mes = project.get\_model\_evaluation\_store("M3s\_1d")

§ me\_list = mes.list\_model\_evaluations()

### Retrieve an array of creation date / accuracy from a store[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#retrieve-an-array-of-creation-date-accuracy-from-a-store "Permalink to this heading")

§ p = client.get\_project("MYPROJECT")

§ mes = project.get\_model\_evaluation\_store("M3s\_1d")

§ me\_list = mes.list\_model\_evaluations()

§ res = []

§ for me in me\_list:

§ full\_info = me.get\_full\_info()

§ creation\_date = full\_info.creation\_date

§ accuracy = full\_info.metrics["accuracy"]

§ res.append([creation\_date,accuracy])

### Retrieve an array of label value / precision from a store[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#retrieve-an-array-of-label-value-precision-from-a-store "Permalink to this heading")

The date of creation of a model evaluation might not be the best way to key a metric. In some cases, it might be more interesting to use the labeling system, for instance to tag the version of the evaluation dataset.

If the user created a label “myCustomLabel:evaluationDataset”, he may retrieve an array of label value / precision from a store with the following snippet:

§ p = client.get\_project("MYPROJECT")

§ mes = project.get\_model\_evaluation\_store("M3s\_1d")

§ me\_list = mes.list\_model\_evaluations()

§ res = []

§ for me in me\_list:

§ full\_info = me.get\_full\_info()

§ label\_value = next(x for x in full\_info.user\_meta["labels"] if x["key"] == "myCustomLabel:evaluationDataset")

§ precision= full\_info.metrics["precision"]

§ res.append([label\_value,precision])

### Compute data drift of the evaluation dataset of a Model Evaluation with the train time test dataset of its base DSS model version[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#compute-data-drift-of-the-evaluation-dataset-of-a-model-evaluation-with-the-train-time-test-dataset-of-its-base-dss-model-version "Permalink to this heading")

§ # using base SMV is implicit

§ drift = me1.compute\_data\_drift()

§ drift\_model\_result = drift.drift\_model\_result

§ drift\_model\_accuracy = drift\_model\_result.drift\_model\_accuracy

§ print("Value: {} < {} < {}".format(drift\_model\_accuracy.lower\_confidence\_interval,

§ drift\_model\_accuracy.value,

§ drift\_model\_accuracy.upper\_confidence\_interval))

§ print("p-value: {}".format(drift\_model\_accuracy.pvalue))

### Compute data drift, display results and adjust parameters[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#compute-data-drift-display-results-and-adjust-parameters "Permalink to this heading")

§ # me1 and me2 are two compatible model evaluations (having the same prediction type) from any store

§ drift = me1.compute\_data\_drift(me2)

§ drift\_model\_result = drift.drift\_model\_result

§ drift\_model\_accuracy = drift\_model\_result.drift\_model\_accuracy

§ print("Value: {} < {} < {}".format(drift\_model\_accuracy.lower\_confidence\_interval,

§ drift\_model\_accuracy.value,

§ drift\_model\_accuracy.upper\_confidence\_interval))

§ print("p-value: {}".format(drift\_model\_accuracy.pvalue))

§ # Check sample sizes

§ print("Reference sample size: {}".format(drift\_model\_result.get\_raw()["referenceSampleSize"]))

§ print("Current sample size: {}".format(drift\_model\_result.get\_raw()["currentSampleSize"]))

§ # check columns handling

§ per\_col\_settings = drift.per\_column\_settings

§ for col\_settings in per\_col\_settings:

§ print("col {} - default handling {} - actual handling {}".format(col\_settings.name, col\_settings.default\_column\_handling, col\_settings.actual\_column\_handling))

§ # recompute, with Pclass set as CATEGORICAL

§ drift = me1.compute\_data\_drift(me2,

§ DataDriftParams.from\_params(

§ PerColumnDriftParamBuilder().with\_column\_drift\_param("Pclass", "CATEGORICAL", True).build()

§ )

§ )

§ ...

## Reference documentation[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#reference-documentation "Permalink to this heading")

There are two main parts related to the handling of metrics and checks in Dataiku’s Python APIs:

* `dataiku.core.model\_evaluation\_store.ModelEvaluationStore` and `dataiku.core.model\_evaluation\_store.ModelEvaluation` in the `dataiku` package. They were initially designed for usage within DSS.

* `dataikuapi.dss.modelevaluationstore.DSSModelEvaluationStore` and `dataikuapi.dss.modelevaluationstore.DSSModelEvaluation` in the `dataikuapi` package. They were initially designed for usage outside of DSS.

Both set of classes have fairly similar capabilities.

For more details on the two packages, please see Concepts and examples.

### dataiku package API[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#dataiku-package-api "Permalink to this heading")

### dataikuapi package API[¶](https://developer.dataiku.com/latest/concepts-and-examples/model-evaluation-stores.html#dataikuapi-package-api "Permalink to this heading")

|  |  |

| --- | --- |

| `dataikuapi.dss.modelevaluationstore.DSSModelEvaluationStore`(...) | A handle to interact with a model evaluation store on the DSS instance. |

| `dataikuapi.dss.modelevaluationstore.DSSModelEvaluationStoreSettings`(...) | A handle on the settings of a model evaluation store |

| `dataikuapi.dss.modelevaluationstore.DSSModelEvaluation`(...) | A handle on a model evaluation |

| `dataikuapi.dss.modelevaluationstore.DSSModelEvaluationFullInfo`(...) | A handle on the full information on a model evaluation. |

| `dataikuapi.dss.modelevaluationstore.DataDriftParams`(data) | Object that represents parameters for data drift computation. |

| `dataikuapi.dss.modelevaluationstore.PerColumnDriftParamBuilder`() | Builder for a map of per column drift params settings. |

| `dataikuapi.dss.modelevaluationstore.DataDriftResult`(data) | A handle on the data drift result of a model evaluation. |

| `dataikuapi.dss.modelevaluationstore.DriftModelResult`(data) | A handle on the drift model result. |

| `dataikuapi.dss.modelevaluationstore.UnivariateDriftResult`(data) | A handle on the univariate data drift. |

| `dataikuapi.dss.modelevaluationstore.ColumnSettings`(data) | A handle on column handling information. |

| `dataikuapi.dss.modelevaluationstore.DriftModelAccuracy`(data) | A handle on the drift model accuracy. |
