# Experiment Tracking with the PythonModel module[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#experiment-tracking-with-the-pythonmodel-module "Permalink to this heading")

The MLflow library provides many functions to log/save and load different flavors of ML models. For example, to log a scikit-learn model, you can simply invoke the `mlflow.sklearn.log\_model(your\_scikit\_model)` method.

To log more exotic ML libraries or custom models, MLflow offers the possibility to wrap them in a python class inheriting the `mlflow.pyfunc.PythonModel` module.

This wrapper is particularly convenient when a model consisting of multiple frameworks needs to be logged as a single MLflow-compliant object. The most common use-case for this is when one needs a given framework to pre-process the data and another one for the ML algorithm itself. In DSS, models logged to and subsequently deployed from the Experiment Tracking interface need to be single objects capable of handling both the data pre-processing and scoring part.

In this tutorial, you will build an example that wraps an XGBoost Classifier with a scikit-learn preprocessing layer and saves them in the MLflow format ready to be visualized in the Experiment Tracking interface and ultimately deployed. This tutorial is based on an example provided in the MLFlow documentation.

Pre-requisites

* A Python code environment containing the following libraries (see supported versions here) :

+ mlflow

+ scikit-learn

+ xgboost

* A DSS project with a managed folder in it.

## Wrapping an XGBoost classifier alongside a scikit-learn pre-processing layer[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#wrapping-an-xgboost-classifier-alongside-a-scikit-learn-pre-processing-layer "Permalink to this heading")

The following class inherits the `mlflow.pyfunc.PythonModel` and contains two crucial methods:

§ class XGBWrapper(mlflow.pyfunc.PythonModel):

§ def load\_context(self, context):

§ from cloudpickle import load

§ self.model = load(open(context.artifacts["xgb\_model"], 'rb'))

§ self.preprocessor = load(open(context.artifacts["preprocessor"], 'rb'))

§ def predict(self, context, model\_input):

§ model\_input = model\_input[['sepal length (cm)', 'sepal width (cm)',

§ 'petal length (cm)', 'petal width (cm)']]

§ model\_input = self.preprocessor.transform(model\_input)

§ return self.model.predict\_proba(model\_input)

The `load\_context()` method will be run at load time and allows to load all the artifacts needed in the `predict()` method. The `context` parameter is a `PythonModelContext` instance that is created implicitly at model log or save time. This parameter contains an `artifacts` dictionary whose values are paths to the serialized objects. For example:

§ artifacts = {

§ "xgb\_model": "path/to/xgb\_model.plk",

§ "preprocessor": "path/to/preprocessor.pkl"

§ }

Where the `xgb\_model.plk` and `preprocessor.pkl` would be a fitted XGBoost model and a scikit-learn preprocessor. Those would be serialized using `cloudpickle` and saved to some local directory.

The `predict()` method of the `XGBWrapper` class is used to predict whatever data is passed through the `model\_input` parameter. That parameter is expected to be a `pandas.DataFrame`. In the case of a classification problem, we recommend this `predict()` method to return the model’s `predict\_proba()` so as to output the different class probabilities along with the class prediction. An added benefit to returning `predict\_proba()` is being able to visualize all the classifier insights once the model is deployed in the flow.

Note that the `predict()` method can also take the same `context` parameter as that found in the `load\_context()` method. Yet, it is more efficient to load artifacts only once using the `load\_context()` method.

## Full example[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#full-example "Permalink to this heading")

In this example, you will be using the Iris dataset available through the scikit-learn datasets module.

### 1. Preparing the experiment[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#preparing-the-experiment "Permalink to this heading")

Start by setting the following variables, handles and function for the experiment:

§ import dataiku

§ from datetime import datetime

§ # Replace these constants with your own values

§ PREDICTION\_TYPE = "MULTICLASS"

§ EXPERIMENT\_FOLDER\_ID = ""          # Replace with your Managed Folder id

§ EXPERIMENT\_NAME = ""               # Replace with your experiment name

§ MLFLOW\_CODE\_ENV\_NAME = ""          # Replace with your code environment name

§ def now\_str():

§ return datetime.now().strftime("%Y%m%d%H%M%S")

§ # Get the current project handle

§ client = dataiku.api\_client()

§ project = client.get\_default\_project()

§ # Create a mlflow\_extension object to easily log information about our models

§ mlflow\_extension = project.get\_mlflow\_extension()

§ # Get a handle on a Managed Folder to store the experiments.

§ mf = project.get\_managed\_folder(EXPERIMENT\_FOLDER\_ID)

### 2. Loading the data[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#loading-the-data "Permalink to this heading")

In a DSS project, create a Python notebook and set its kernel to the code environment listed in the above pre-requisites.

Run the following code:

§ import dataiku

§ import pandas as pd

§ from sklearn import datasets

§ from sklearn.model\_selection import train\_test\_split

§ iris = datasets.load\_iris()

§ features = iris.feature\_names

§ target = 'species'

§ df = pd.DataFrame(iris.data, columns=features)

§ mapping = {k:v for k,v in enumerate(iris.target\_names)}

§ df[target] = [mapping.get(val) for val in iris.target]

§ df\_train, df\_test = train\_test\_split(df,test\_size=0.2, random\_state=42)

§ X\_train = df\_train.drop(target, axis=1)

§ y\_train = df\_train[target]

§ X\_test = df\_test.drop(target, axis=1)

§ y\_test = df\_test[target]

### 3. Preprocessing the data[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#preprocessing-the-data "Permalink to this heading")

In this step:

* Specify the target and the features.

* Set up a scikit-learn Pipeline to impute potential missing values and rescale continuous variables.

§ from sklearn.pipeline import Pipeline

§ from sklearn.impute import SimpleImputer

§ from sklearn.preprocessing import StandardScaler

§ from cloudpickle import dump, load

§ preprocessor = Pipeline([

§ ('imp', SimpleImputer(strategy='median')),

§ ('sts', StandardScaler()),

§ ])

§ X\_train = preprocessor.fit\_transform(X\_train)

§ X\_test = preprocessor.transform(X\_test)

§ artifacts = {

§ "xgb\_model": "xgb\_model.plk",

§ "preprocessor": "preprocessor.pkl"

§ }

§ # pickle and save the preprocessor

§ dump(preprocessor, open(artifacts.get("preprocessor"), 'wb'))

### 4. Training and logging the model[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#training-and-logging-the-model "Permalink to this heading")

Finally, train the xgboost classifier. Log the hyperparameters, performance metrics and the classifier itself into your Experiment run.

§ import xgboost as xgb

§ from sklearn.metrics import precision\_score

§ hparams = {

§ "max\_depth": 5,

§ "n\_estimators": 50

§ }

§ with project.setup\_mlflow(mf) as mlflow:

§ experiment\_id = mlflow.create\_experiment(f'{EXPERIMENT\_NAME}\_{now\_str()}')

§ class XGBWrapper(mlflow.pyfunc.PythonModel):

§ def load\_context(self, context):

§ from cloudpickle import load

§ self.model = load(open(context.artifacts["xgb\_model"], 'rb'))

§ self.preprocessor = load(open(context.artifacts["preprocessor"], 'rb'))

§ def predict(self, context, model\_input):

§ model\_input = model\_input[['sepal length (cm)', 'sepal width (cm)',

§ 'petal length (cm)', 'petal width (cm)']]

§ model\_input = self.preprocessor.transform(model\_input)

§ return self.model.predict\_proba(model\_input)

§ with mlflow.start\_run(experiment\_id=experiment\_id) as run:

§ print(f'Starting run {run.info.run\_id} ...\n{hparams}')

§ model = xgb.XGBClassifier(\*\*hparams)

§ model.fit(X\_train, y\_train)

§ # pickle and save the model

§ dump(model, open(artifacts.get("xgb\_model"), 'wb'))

§ preds = model.predict(X\_test)

§ precision = precision\_score(y\_test, preds, average=None)

§ run\_metrics = {f'precision\_{k}':v for k,v in zip(model.classes\_, precision)}

§ # Save the MLflow Model, hyper params and metrics

§ mlflow\_pyfunc\_model\_path = f"xgb\_mlflow\_pyfunc-{run.info.run\_id}"

§ mlflow.pyfunc.log\_model(

§ artifact\_path=mlflow\_pyfunc\_model\_path, python\_model=XGBWrapper(),

§ artifacts=artifacts)

§ mlflow.log\_params(hparams)

§ mlflow.log\_metrics(run\_metrics)

§ mlflow\_extension.set\_run\_inference\_info(run\_id=run.\_info.run\_id,

§ prediction\_type='MULTICLASS',

§ classes=list(model.classes\_),

§ code\_env\_name=MLFLOW\_CODE\_ENV\_NAME)

§ print(f'Run {run.info.run\_id} done\n{"-"\*40}')

You’re done! You can now go to the Experiment Tracking interface to check the performance of your model. You can also either deploy the model from that interface or using the dataiku API. For either deployment, you will need an evaluation dataset that has the same schema as that of your training dataset (although the order of the columns is does not matter).

## Conclusion[¶](https://developer.dataiku.com/latest/tutorials/machine-learning/experiment-tracking/xgboost-pyfunc/index.html#conclusion "Permalink to this heading")

In this tutorial, you learned how to wrap two frameworks under a single MLFlow-compliant object and log it along with key metadata. Having to use multiple frameworks is frequent in many areas of ML. For example:

* In Natural Language Processing (NLP) one may want to pre-process the data using a pre-trained embedding layer (say `spaCy`) before scoring them using a `PyTorch` classifier.

* In the field of Computer Vision, the `Pillow` library can be used to decode image bytes from base64 encoding before being scored using `Keras` or some other library.
