Jump to content
  • Out-of-the-box AI: An example with customer churn in Telecom


    When data scientists prepare models to create predictions, typically this means manually writing extensive code that is specific to the data used to train the model. For each step in the modeling process, the data scientist makes a lot of decisions that might not be obvious to anyone without domain knowledge. TIBCO has two intelligent products that can help with the different steps in the data science process: Spotfire X has an advanced feature, AI-Powered Suggested Visualisations; TIBCO Data Science has AutoML an extension that automatically creates a machine learning pipeline.

    screenshot_2020-01-20_at_13_11.47_0.png.7bb00638f35e5b1aeb1754a465836ba0.png
    You can find the TIBCO Data Science AutoML extension here.

    Spotfire X's AI-Powered Suggestions

    The first step in a sound data science approach is exploratory data analysis. This is where the dataset is examined to check assumptions, looking at variable distributions, correlations, and any other patterns that can offer insight into the dataset. Spotfire's AI-Powered Suggestions examines a column of interest and produces a set of visualizations of the variables that most strongly influence it, ranked in decreasing strength.


    The AutoML Extension for Spofire Data Science Team Studio

    The AutoML extension is a set of operators for the Team Studio component of Spotfire Data Science Team Studio, a low/no-code web environment where users can drag and drop pre-configured nodes into pipelines that can be used to train machine-learning models. The main operator is the AutoML orchestrator. This node needs to link up to the data source, specify the target variable to be predicted and it generates fully parameterized workflows for each of the steps in the machine learning pipeline, as well as models in an exportable format ready to be deployed into production.


    Predicting Customer Churn for Telco: A Synthetic Dataset

    Attached is a synthetic dataset on customers for a fictitious telecom company. The dataset consists of the features shown in the data dictionary belowIn the telecom industry, churners are known to have incoming calls from other churners before leaving. Charges might also be important- we have all received that spike in a bill after an offer or contract was up which had us shopping around for other providers!

    telcochurn.txt

    Feature Name Description
    CustomerStatus Churn vs Active customer
    InitialChannel Channel of acquisition
    Handset Type of mobile
    ExtraChargeForTexting Extra charges for text
    ExtraChargeForCalls Extra charges for minutes
    ExtraChargeForData Extra charges for data
    ChargeLast12Months Charges for last 12 months
    ChargeLast3Months Charges for last 3 months
    ChargeLastMonth Charge last month
    TotalInCalls Total number of incoming calls
    TotalOutCalls Total number of outgoing calls
    PropInCallsFromChurner Proportion of incoming calls from other Churners
    PropOutCallsFromChurner Proportion of outgoing calls to other Churners
    DataUpload Amount of data uploaded
    DataDownload Amount of data downloaded

    CustomerStatus will be the target of interest. I will try out Spotfire's AI Suggestions to see what visualizations suggest variables as most strongly related to the target. I will then try out the AutoML orchestrator to predict which of those customers will churn (i.e. leave a provider), and compare the feature importance ranking from running AutoML with the recommended visualizations from Spotfire.


    Spotfire X?s AI Suggestions Engine

    The recommendations engine in Spotfire is an advanced feature that helps to make sense of the dataset no matter its size. After specifying the column of interest, in this case, the target, Spotfire runs a specialized algorithm over all the other columns and selects the variables that most strongly relate to it. The first visualization is of the target, followed by other variables in decreasing rank.


    Spotfire X: Getting the Suggested Visualisations

    After importing the Telco dataset, the Spotfire dashboard shows three ways to start. Selecting "Start from Data" gives a dropdown menu to select the target, and once selected, the menu extends to show the top visualizations.

    For a video walkthrough of how the suggested visualizations work, refer 


    Spotfire X: Telco's Suggested Visualisations

    screenshot_2020-01-20_at_16_47_00.thumb.png.88a4a5d8e014893e45aae0fddd4f3156.png

    Focusing on the first four suggested visualizations, the most important features correlated to our column of interest CustomerStatus seem to be PropInCallsFromChurner and ExtraChargesForCalls. Notice:

    • The first visualization is a distribution of CustomerStatus. Though the classes aren't balanced, there is some representation for both, suggesting that a machine learning model could classify each well. 

    • The proportion of calls received from a churner is considerably higher for a churning customer than for one that is not. 

    • The proportion of calls received from a churner for an active customer is not affected by the device of the user.

    • Extra charges for calls seem to be higher in churning customers, and from the scatter plot we can see that the proportion of calls received from a churner is also higher. All the outliers are churners; very high extra charges and a very high proportion of calls from other churners seem important. These outliers are a small set of data points that may not add much to modeling, but help with understanding the data, which Spotfire has captured in the visualization.

    Through the different Spotfire suggested visualizations, the dataset is better understood before using it in the modeling step.


    AutoML: Running the Orchestrator

    After using Spotfire to do a little exploratory analysis and to gain a better understanding of the dataset, I am ready for the modeling step. AutoML can be used to create a model and predict whether a customer will churn.  AutoML is meant to alleviate some of the manual drudgeries of assembling a machine learning pipeline from scratch while also incorporating best practices. It uses common techniques to automatically:

    • generate features

    • select candidate features for modeling

    • execute different modeling algorithms, including a hyper-parameter search

    • select the "best" model based on common scoring criteria

    By creating workflows that are fully customizable, some of the parameters can be changed as the requirements change, or as more domain knowledge becomes available. The output can also be examined to see which features of the model are most important.

    I run the AutoML orchestrator in four simple steps: 

    1. Load in the data source.

    2. Connect the AutoML orchestrator to the data source.

    3. Configure the name of the target (dependent) column and choose whether model complexity should be shallow or deep. The complexity refers to the hyperparameter tuning- shallow being a smaller grid search and deep a larger one. I will run both to examine the difference in the best model provided. 

    4. Press run.

    See the animation here.


    AutoML: Model Leaderboard and Results

    The shallow run takes less time to complete as it has fewer hyperparameters to try. After running both, I select the orchestrator to view the results of the different models. Interestingly, the logistic regression model has been ranked ?best? by both the shallow and deep runs based on its accuracy, with the following parameters:

    Penalizing Parameter(λ):     1.2678037866758405E-4

    Elastic Parameter(α):      1.0

    The elastic parameter for the logistic regression with alpha equal to 1 means the Lasso regularisation gives the highest model accuracy out of all the different configurations the orchestrator tried, in both the shallow and the deep run. This could be an indication that we have reached the maximum possible accuracy with the given dataset, or that we may need to explore the hyperparameters further around the selected value.


    AutoML: Explainability

    AutoML also creates a workflow called 'Explainability'. This workflow is used to populate a Spotfire template that is meant to be used interactively to give insight into the resulting model.

    The explainability template in Spotfire is shown below. On the right, a chart of the top features as ranked by the model is displayed. Comparing the suggestions from Spotfire in the previous step, some similar features are at the top of the rankings: ChargeLastMonth is highest, followed by ChargeLast12Months, ChargeLast3Months and PropInCallsFromChurner. High charges in the last bill potentially signal a churning customer. 

    The next page shows an interactive and deeper view of the predictors. By selecting the top one, in this case ChargeLastMonth, the distribution of data available for the predictor looks skewed. The prediction seems to be based on the most frequent class- 143.50 charged in the last month aligns with the global prediction of probably being an active customer. The view of the second predictor's distribution also looks quite skewed in a similar manner. The two highest ranked variables are not evenly supported by the data. There are few data points to support the higher predictions for the top two variables, indicating that more data may be required, or maybe a different approach to engineering the features to mitigate the lack of data.

    See the animation here.


    Spotfire AI Suggested Visualisations and AutoML's top predictors

    Spotfire suggested two top predictors, PropInCallsFromChurner followed by ExtraChargesForCalls, that are different to the two top predictors of the AutoML orchestrator, ChargeLastMonth and ChargeLast12Months. Interestingly, the explainability dxp did include PropInCallsFromChurner and ExtraChargesForCalls in the top 5 predictors, so they can be examined for more information.

    image6_10.thumb.png.2465644330e206226e5a110525adc760.png

    image8_8.thumb.png.0b521afdaab00eb92f63a95360a28276.png

    In the explainability template, PropInCallsFromChurner shows a more even distribution than the higher ranked predictors and a probability of churn that looks more linear across the different bins. This suggests that the proportion of calls received from a churner has a clear relationship to whether the customer churns. ExtraChargesForCalls also has a skewed distribution, but a clear pattern in the predictions, suggesting some kind of relationship. Spotfire ranked these two variables higher because it has better confidence in their correlation to the target, whereas the machine-learning model might be biased by the lack or abundance of data in different regions. The explainability template from AutoML is important to understanding the predictors better and can reveal potential flaws in the data used for training with the AutoML orchestrator.

    The tools offer different capabilities that can be used in conjunction in the machine learning pipeline: Spotfire's strength is in visualizing a variable's correlation to other columns, and AutoML's is to produce a prediction via a fully automated machine learning pipeline, including a deployable model. Each would have its place in the data scientist's toolkit.


    A New Kind of Data Science Workflow

    Spotfire and AutoML are different but complementary tools that took little technical skill to use, and within minutes I had done the exploratory data analysis, found the 'best' model (in a deployable format), and produced two sets of predictor rankings to compare. Used together, the tools are powerful enough for someone with domain knowledge but maybe little coding ability to use. For all the citizen data scientists, developers, and others without a strong mathematical or statistical background, I would recommend using the tools in conjunction and examining them in depth to understand the data.

    For a more in-depth tutorial on how to use AutoML, click on the Reference Info tab here.

    Happy no coding!

    Noora Husseini is a data scientist in the TIBCO Data Science team, based in London.  Her interest in data science and artificial intelligence runs the gamut, from the ethical implications of how we use data to natural language processing to experimenting with the latest open source libraries. She likes wearing and designing obscure fashion labels, making friends with animals (especially cats) and creating the best playlists to dance to.


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...