##
**Introduction**

With the advent of Python Data Functions in Spotfire the door has been opened for training machine learning models using the scikit-learn Python ecosystem in a Spotfire application.

While scikit-learn, together with other packages, like pandas, offers all the functionality needed to train machine learning models, it is helpful to have available a set of functions that align with higher-level *logical* tasks rather than the lower-level *technical* tasks addressed by scikit-learn. The ml_modeling module of the spotfire-dsml Python Package (which is part of the Data Science and Machine Learning Toolkit for Python, or DSML Toolkit) contains such higher-level functions, written in terms of functions from scikit-learn, pandas, tensorflow, xgboost, etc. In this article we demonstrate how to write Python code to train, evaluate, save and load train machine models using the ml_modeling module of the spotfire-dsml package.

Central to ml_modeling is the notion that all models are defined as pipelines that include data preprocessing steps to deal with non-numeric data and with missing data, as well as the actual machine learning algorithms. The module contains functions to define such pipelines, train them, either on a hyperparameter-grid or with fixed hyperparameters, evaluate their performances using a large variety of metrics, save and load them, compute feature importances, and compute learning curves.

This allows for the development of robust, concise machine learning code in an efficient manner. With just a few calls to the functions in the module, mature pipeline models can be produced. These models can be saved as one single pipeline object, and loaded into any Python environment for prediction purposes. For example, model-training could be done in a Spotfire DXP by a Spotfire Data Function and model-predictions could be made in a different Spotfire DXP, or even outside any Spotfire application.

In this document we describe the currently available functions, and give code examples of how to use them.

##
**Pipeline-Models**

More often than not, data needs to be preprocessed before it can be used as input for a machine learning algorithm for regression or classification. This preprocessing also needs to be applied on the data that is used by the trained model to make predictions. Hence it makes sense to package the preprocessing and the regressor or classifier into one object. Moreover, not only the regressors or classifiers are trained objects, also most of the preprocessing steps are, like imputers, encoders, scalers, which all need to be saved and made available in the prediction environment.

The Pipeline class of scikit-learn offers the functionality to define a complete pipeline-model, and train, evaluate, soave and load it as one object. The default pipeline architecture in ml_modeling looks like shown in Fig 1.

Fig 1. Architecture of the default pipeline, here shown for a random forest classifier.

The incoming data is split into numeric and non-numeric columns. The numeric columns undergo mean-imputation and standard scaling. The non-numeric columns undergo encoding and imputation, with different encodings for low- and high-cardinality columns. The preprocessed data is recombined and sent to the proper machine learning algorithm.

##
**Functions in the ml_modeling module of the spotfire-dsml package**

The functions in the package align with high-level tasks. That is: they align with the *logical* structure of machine learning modeling, as opposed to aligning with various *technical* subtasks, as do the scikit-learn functions. First we describe the available functions, and then we demonstrate their usage in code examples.

### Currently available functions

CreateClassificationPipeline

CreateRegressionPipeline

Pipelines for binary classification and regression are defined by these functions. These pipelines split the incoming training data into numeric features, low-cardinality non-numeric features, and high-cardinality non-numeric features. The user can specify the maximum cardinality of low-cardinality data and the maximum cardinality of high-cardinality data. Features that exceed the maximum cardinality of high-cardinality data are automatically dropped.

For the numeric data, the pipeline trains a mean-imputer and a standard scaler. For the non-numeric data, the pipeline trains an imputer, and two encoders: the low-cardinality features are one hot encoded, and the high cardinality features are target-encoded (also known as impact-encoding).

The numeric and non-numeric pipelines are combined to create the input for the binary classification algorithm or the regression algorithm. A choice can be made from various algorithms: random forest, xgboost, and neural network for binary classification and regression, and also logistic regression for binary classification and ridge regression for regression. The user can supply a set of hyperparameters for each of these algorithms, or just use the default values.

TrainClassifier

TrainRegressor

ClassificationGridSearchCV

RegressionGridSearchCV

Training of the pipeline can be done straight or using a grid-search with cross-validation. The default grid for a number of hyperparameters can be used, or the user can specify a grid. All hyperparameters that are documented for the classifiers and regressors from scikit-learn and xgboost can be used. The grid-search returns the optimal trained pipeline. Predictions can be made by invoking the standard predict or predict_proba pipeline methods.

EvaluateClassifier

EvaluateRegressor

The trained pipeline is evaluated, using a large number of metrics. Binary classification pipelines are evaluated with ROC- and Precision Recall-curves. AUC and Average Precision are computed. The f1-optimal decision threshold is determined, and the confusion matrix is computed using the f1-optimal decision threshold. Precision, recall, f1-score, and accuracy are derived from the confusion matrix. Regression pipelines are evaluated using root mean squared error, mean absolute error, r-squared, and explained variance. Also, a data frame with test-set target-values and predicted values is returned.

ComputeFeatureImportances

The model-agnostic permutation-importances method is used to compute feature importances of the pipeline-models.

Usually, you train on all available training data (after a train-test split). The question whether better models could be trained if more data were available, or if the training with less data would result in equally good models is addressed by this function. Models are trained for 10%, 20%, . . . , 90% of the training data, and the performances of these models are compared. If the performance does not saturate when using a larger fraction of the data, gathering more training data training will potentially be beneficial.

LoadModel

SaveModel

**Code examples**

Rather than extensively discuss all the arguments that can be passed into each of these functions, we will provide some code-examples from which the flexibility of the functions can easily be inferred.

### A. Training, evaluation, saving and loading of a random forest binary classifier

We train and evaluate a binary classification model to predict churn of bank customers.

import the spotfire-dsml package and a few other packages

```
from spotfire_dsml.ml_modeling import ml_modeling as smo
import pandas as pd
from sklearn.model_selection import train_test_split
```

set up the classification problem, as usual, and specify the positive class 1

```
df = pd.read_csv('bankchurners.csv')
target = 'Attrition_Flag'
dfX = df.drop(target, axis=1)
dfy = df[target].replace({'Attrited Customer':1,'Existing Customer':0}).astype(int)
```

do the stratified train-test split, as usual

`dfX_train, dfX_test, dfy_train, dfy_test = train_test_split(dfX, dfy, stratify=dfy)`

define a classification pipeline with a random forest classifier with default hyperparameters, other choices are algo='logisticregression' and algo='xgboost'

`pipeline = smo.CreateClassificationPipeline(dfX_train, max_low_cardinality=2, max_high_cardinality=8, algo='randomforest')`

or, alternatively, define a classification pipeline with a random forest classifier with user-specified hyperparameters (all hyperparameters documented in scikit-learn or xgboost can be passed in)

`pipeline = smo.CreateClassificationPipeline(dfX_train, max_low_cardinality=2, max_high_cardinality=8, algo='randomforest', classifier_args={'n_estimators':200, 'min_samples_leaf':7})`

train the classification pipeline

`model = smo.TrainClassifier(pipeline, dfX_train, dfy_train)`

evaluate the trained pipeline-model on the test set

`df_precision_recall, df_roc, df_f1_threshold, df_y_pred, df_confusion_matrix, df_scores = smo.EvaluateClassifier(model, dfX_test, dfy_test)`

compute feature importances

`df_importances = smo.ComputeFeatureImportances(model, dfX_test, dfy_test)`

compute a learning curve

`df_learning_curve = smo.ComputeLearningCurve(pipeline, dfX_train, dfy_train)`

save the model

`smo.SaveModel(model, 'pathname_of_saved_model')`

load the model

`loaded_model = smo.LoadModel('pathname_of_saved_model')`

make predictions, i.e. compute the probabilities of churn, here for simplicity on the test data

`Y_pred = loaded_model.predict_proba(dfX_test)[:,1]`

### B. Training, evaluation, saving and loading of an xgboost regressor with early stopping

We train and evaluate a regression model to predict the cooling loads of buildings.

import the spotfire-dsml package and a few other packages

```
from spotfire_dsml.ml_modeling import ml_modeling as smo
import pandas as pd
from sklearn.model_selection import train_test_split
```

set up the regression problem, as usual

```
df = pd.read_csv('ENB2012.csv')
target = 'Cooling Load'
dfX = df.drop([target,'Heating Load'], axis=1)
dfy = df[target]
```

do the train-test split, as usual

`dfX_train, dfX_test, dfy_train, dfy_test = train_test_split(dfX, dfy)`

define a regression pipeline with an xgboost classifier with default hyperparameters and early stopping parameters other choices are algo='randomforest' and algo='ridge'

`pipeline = smo.CreateRegressionPipeline(dfX_train, max_low_cardinality=2, max_high_cardinality=8, algo='xgboost')`

or, alternatively, specify the early stopping parameters when defining the pipeline

`pipeline = smo.CreateRegressionPipeline(dfX_train, max_low_cardinality=2, max_high_cardinality=8, algo='xgboost', classifier_args={'n_estimators':2000,'early_stopping_rounds':10})`

train the regression pipeline, pass in the test data to facilitate early stopping

`model = smo.TrainRegressor(pipeline, dfX_train, dfy_train, dfX_test, dfy_test)`

evaluate the trained pipeline-model on the test set

`df_regression_evaluation, df_scores = smo.EvaluateRegressor(model, dfX_test, dfy_test)`

compute feature importances

`df_importances = smo.ComputeFeatureImportances(model, dfX_test, dfy_test)`

compute a learning curve

`df_learning_curve = smo.ComputeLearningCurve(pipeline, dfX_train, dfy_train)`

save the model

`smo.SaveModel(model, 'pathname_of_saved_model')`

load the model

`loaded_model = smo.LoadModel('pathname_of_saved_model')`

make predictions for the cooling load, here for simplicity on the test data

`Y_pred = loaded_model.predict(dfX_test)`

### C. Training and evaluation of a random forest classifier

We train a binary classification model to predict churn of bank customers using a grid search.

import the spotfire-dsml package and a few other packages

```
from spotfire_dsml.ml_modeling import ml_modeling as smo
import pandas as pd
from sklearn.model_selection import train_test_split
```

set up the classification problem, as usual, and specify the positive class 1

```
df = pd.read_csv('bankchurners.csv')
target = 'Attrition_Flag'
dfX = df.drop(target, axis=1)
dfy = df[target].replace({'Attrited Customer':1,'Existing Customer':0}).astype(int)
```

do the stratified train-test split, as usual

`dfX_train, dfX_test, dfy_train, dfy_test = train_test_split(dfX, dfy, stratify=dfy)`

define a classification pipeline with a random forest classifier with default hyperparameters

`pipeline = smo.CreateClassificationPipeline(dfX_train, max_low_cardinality=2, max_high_cardinality=8, algo='randomforest')`

train the classification pipeline with a grid search with a user-specified grid (all hyperparameters documented in scikit-learn or xgboost can be used)

`model = smo.ClassificationGridSearchCV(pipeline, dfX_train, dfy_train, classifier_grid_args={'n_estimators':[50,100,200],'min_samples_leaf':[1,3]})`

evaluate the trained pipeline-model on the test set

`df_precision_recall, df_roc, df_f1_threshold, df_y_pred, df_confusion_matrix, df_scores = smo.EvaluateClassifier(model, dfX_test, dfy_test)`

**Visualizing the evaluation results in Spotfire**

Fig 2. Evaluation of a binary classifier

Fig 3. Evaluation of a regressor

**Conclusion**

It is really easy and logical to train, evaluate, save, load pipeline-models using the ml_modeling module of the spotfire-dsml Python package. The model-evaluation functions return a collection of data frames with performance metrics, that can either be used to create Spotfire visualizations, or alternatively be visualized using Python packages like matplotlib and seaborn. Likewise, for making predictions the models can be used in Spotfire Python Data Functions, or any other Python environment.

Please note that the DSML Toolkit also contains modules for model-explainability and for model-monitoring for the pipeline-models trained by ml_modeling functionality. For details, see the Community Articles 'Introducing XWIN Explainability' and 'Model Monitoring with DSML Toolkit for Python'.

More details about the spotfire-dsml package can be found in the Community article Python toolkit for data science and machine learning in Spotfire. Example Spotfire applications can be downloaded from the Exchange page DSML Toolkit for Python - Documentation and Spotfire® Examples.

## Recommended Comments

There are no comments to display.