##
**Introduction**

Supervised machine-learning models derive their predictive power from exploiting statistical patterns in the space of the predictor and target variables. Models operate under the implicit assumption that the patterns found during training also exist in the prediction data, or (in more technical terms) that the training and prediction data are drawn from the same distribution.

In many situations that assumption is only approximately true, and often just for a limited time. It is likely that the data will drift gradually, thereby changing the patterns that were originally learned by the model. Eventually this invalidates the model, and it has to be retrained on fresh data.

Detecting such data drift and alerting that a model is operating under outdated assumptions is an important aspect of using machine learning models.

For instance: a risk-assessment model for mortgage loan applicants involves quantities like the applicant's income and home values. Such quantities are affected by mechanisms like inflation and asset-price bust and boom cycles. The mechanisms modify the underlying patterns in the data, rendering the conditions under which the model has been trained increasingly outdated, hence diminishing its usefulness.

There is a trade-off between doing nothing (and risk losing money or credibility) and re-training too often (which can also be costly). Ideally, we could optimize this balance.

Monitoring statistical properties of the data is therefore important. In addition, the way the model responds to the new data is also crucial - after all, some data changes might happen within the tolerance of the model, whereas others could completely alter the model's predictions. Combining the pure changes in data distributions with the changes in the way the model reacts to the new data, we can set some criteria that signal the need for re-training the model.

The ml_metadata and ml_drift modules of the spotfire-dsml Python package will help you do that. These Python modules support classification and regression models. You can choose to rely on knowing the ground-truth (labels) of the new data (explicit drift detectors), or to handle the situation in which the new labels are simply not yet available, and build a set of 'clues' that are distilled into a small number of drift measures (implicit drift detectors). The generated drift measures can be tracked in time, to gain an appreciation of when drift becomes substantial and sustained. In the description that follows, we will concentrate on building implicit drift detectors, assuming that, because of the data volume or type of task, we cannot rely on getting hold of the new labels in a timely manner.

We designed the blueprint of the workflow this way: we initially establish a *baseline*, so that every time a new dataset comes our way, we can compare the new data and the model's predictions on the new data to the baseline. We do this at set points in time, to obtain a view of what is happening over a period of time.

##
**Metadata**

We assume that the whole training dataset might not always be available. Maybe it is very big, maybe there are security concerns. We therefore distill information into what we call *metadata*. Metadata is made of statistical information about predictor variables, plus information about how the model acts on the data at the initial stage.

The metadata information is generated using the ml_metadata module. This includes for instance: a binned version of each numeric or date-time predictor, the frequency distribution of each categorical variable, the pair linear correlation of numeric variables. Additionally, it can store the initial predictions of the model and the initial variable importance (calculated as sensitivity to permutations in values for each predictor variable in turn).

The resulting metadata objects (one for data-specific and one for model-specific metadata) are returned as JSON strings.

##
**Drift Measures**

Once the baseline metadata is established, we look for ways to quantify drift in the data and model response. We can do this at regular intervals, when we have a new batch of data. Since we are examining statistical properties such as distributions, this new batch must be of a reasonable size, so that this kind of statistical analysis makes sense.

We selected a number of metrics that are not too sensitive to data size. Also, in order to minimize false positives, we do not want to cry out data drift every time the distribution changes a little. In some cases, we have more than one metric that we can combine into a single drift measure. In other cases, we want to keep the metrics separate as they will work on different scales. The metrics we use depend also on the data structures we are comparing. For instance, we use metrics such as the Cosine and Cramer's V Distance when comparing distributions, or functional mapping from direct value change to a scale of 0-1 using logistic curves when analysing *e.g. *Variable Importance tables.

These quantities are calculated in the ml_drift module. The drift calculation for each new dataset takes as input the baseline metadata, the model and the new dataset, and returns a table with the calculated drift measures.

##
**Code examples**

A basic call to calculate the metadata for the initial dataset looks like this:

```
from spotfire_dsml.ml_metadata import metadata_data as mdd
data_json, *outputs = mdd.calculate_data_metadata(df, target, *parms)
```

where the first result is the JSON string containing all the metadata information. Optionally, a number of output data frames (representing the various types of metadata that were calculated) can be returned for display purposes. The input parameters include the initial dataset, the name of the target variable and additional optional parameters to direct how the metadata should be calculated.

Similarly, the metadata describing the results of applying the model to the initial dataset (df) can be calculated this way - in this example we have a classification model:

```
from spotfire_dsml.ml_metadata import metadata_model as mdm
model_json, *outputs = mdm.calculate_classification_model_metadata(model, df, target, *parms)
```

Every time a new dataset (new_df) is available, the drift metrics can be evaluated using:

```
from spotfire_dsml.ml_drift import drift_data as dd
drift_metrics = dd.calculate_drift_entry_point(data_json, model_json, model, new_df, target, *parms)
```

##
**Application Example**

In this example we have trained three different predictive model pipelines on the Bank Churners dataset to perform binary classification. The model pipelines (using Logistic Regression, Random Forest and eXtreme Gradient Boosting) were trained via the ml_modeling module in spotfire-dsml. This module is dedicated to training and evaluating pipeline models. For details, see the Community Article 'Training Machine Learning Models with DSML Toolkit for Python'.

The resulting Python objects were saved in the file system and then simply reloaded into the drift Spotfire dxp.

We have then generated 24 additional simulated datasets. Each one of these new datasets has some distortion, designed to mimic realistic changes occurred at these 24 fictitious time steps. Selecting each predictive model in turn, the drift detector has been applied to these 24 datasets to produce drift profiles over time.

In the simulated datasets, some changes in the data were introduced slowly. Some other changes were a more blunt one-off that could for instance be attributed to changes in data collection practices. Throughout the data simulation, predictors expressing total amounts keep slowly changing because of inflation. The total number of times a customer contacted support also gently drifts, as the bank hires more reps and more calls get through. At step 16 we introduced a more sudden change: variables that represent counts taken over a period of time now count over a period of 18 months rather than 12, resulting in a jump in values for these variables (for instance, the total count of customer contacts).

The following figures show some of the results for drift detection. In Figure 1 and 2, we show a more detailed plot (the bar chart) alongside a line chart displaying the corresponding drift measure as a function of time (from 0 - the original dataset - to 24). Time grows from top left to bottom right for the trellised bar charts. Both the slow drift and the more sudden drift (around time 16) are detected in the measures.

Figure 1. Pure data drift. Individual variable drift in time (left) and corresponding total variable drift measure (right).

In Figure 1, each bar represents a variable and its height is the corresponding variable drift, purely due to changes in the data distributions for individual columns. The line chart on the right show the variable drift measure, summed over all variables at each step.

Figure 2. Model prediction drift. Model drift measure (left) and corresponding change in prediction distribution shape (right) for Random Forest pipeline.

In Figure 2, we are now looking at how the model's predictions change when the model is applied to the simulated datasets. The distributions on the right are the probabilities of customer churn for each time step. In the case of this particular Random Forest model, the shape becomes markedly flatter around time 16, reflected in a jump in the corresponding drift measure. Figure 3 below shows the same distribution drift profile for a different model (Logistic Regression). In this case, although the expanded scale on the right reveals an increase in the drift measure around step 16, the changes in data have clearly a much reduced effect on the model response, when compared to Random Forest.

Figure 3. Model prediction drift. Model drift measure (left) on the scale of Figure 2, and same data on an expanded vertical scale (right) for Logistic Regression.

Based on our simple experiment, it is interesting to note that, although initially Random Forest performed better than Logistic Regression (this information can be easily seen in the initial metadata) the latter appears more robust to data change, in this specific application example. When also considering changes in variable importance - not shown here - as measured by the sensitivity of the model with respect to its initial predictions when re-shuffling each variable in turn, it is also apparent that the Random Forest model becomes increasingly sensitive to the values of one of the variables (the total count of customer contacts), much more so than the other two models.

##
**Takeaway**

There is no precise threshold determining if and when to retrain. It depends on the number and relevance of predictors, on the specific model, and on collecting the evidence over time. There are a number of indicators that may alert us of a diminished predictive power. These may come purely from the data, signalling when the predictor space shifts towards new patterns, away from the initial blueprint. They can also come from looking at the response of the model when applied to new data. Model monitoring with the purpose of finding the sweet spot for re-training a model is different from anomaly detection, in that we are not looking for a small percentage of short-lived fluctuations (or data anomalies) but for the onset of a 'new normal', when changes are marked and sustained. It is in the latter case that we can presume that the model no longer detects the typical data patterns.

The different reaction from different models to the same data drift is not something we can generalize on the basis of a single test, but it is a sobering thought that initial performance of a model is not all that counts: we may want to add to our model selection arsenal a way of performing initial experiments to decide on how robust each or the candidate models is to data drift.

More details about the *spotfire-dsml *Python Toolkit can be found in the Community article Python toolkit for data science and machine learning in Spotfire. Example Spotfire applications can be downloaded from the Exchange page DSML Toolkit for Python - Documentation and Spotfire® Examples.

## Recommended Comments

There are no comments to display.