Market Basket Analysis Python Data Function for Spotfire® - Documentation - Spotfire

This article supports the Market Basket Analysis data function which can be downloaded from the Community Exchange.

Introduction

Product recommendations are an important tool for marketing and sales teams. This article explains how you can use Market Basket Analysis to help understand customer buying behavior (also referred to as Affinity analysis). Market Basket Analysis (MBA) is used to analyze the buying habits of different customers and identify the combination of products that are frequently bought together. The output of this analysis will allow you to suggest products to customers with a higher likelihood of a sale. In this demonstration of the Market Basket Analysis Python Data Function (free download on this Exchange page), we will look at all of the purchases in the dataset and will generate measures that indicate the strength of the relationships between the different products.

MBA uses association rules, a data mining tool that finds the regularities between different products. Different rules can be generated by varying the measures to the association rule algorithm.

Before going deeper, here are some terms that will help your understanding of the analysis:

Itemset: Set of two or more items purchased by the user.
Frequent Itemset: The itemsets that are bought together frequently. Which means that those itemsets are more supported by the customers and we can put those items together to suggest to the customer to buy them together.
Transaction: Transactions can be referred to as purchases over time. Each transaction is associated with a transaction ID. In MBA, we are mostly interested in transactions with more than one item.
Support: Support of an itemset can simply describe the popularity of that itemset. It is calculated as

Support (A,B) = Number of transactions contains (A,B) / Total number of transactions (N)

where (A,B) Itemset

Confidence: Confidence can be defined as how likely two or more items are purchased together. For example, if A and B are two items, what would be the probability that B will also be purchased when A is purchased?

Confidence (A⇒B) = Transactions containing both A and B / Transactions containing A

The value of confidence ranges from 0 to 1. A higher value of confidence means that the items are more likely to be purchased together. If the value is 1, every transaction which has A will definitely have B.

Here, A is also known as Antecedent and B is Consequent. More clearly, the antecedent is the item or group of items that are purchased prior to a particular item, also known as the consequent.

Lift: It is calculated by comparing the probability of the two items being purchased together to the probability of the items being purchased independently. A high lift value indicates a strong relationship between the two items, suggesting that customers are likely to purchase them together.

Lift (A⇒B) = Probability(A and B) / Probability(A)*Probability(B)

In this equation, "A" and "B" represent the items being analyzed.

Conviction: A high conviction value indicates that the two items are often purchased together, while a low conviction value indicates that the two items are rarely purchased together.

Conviction (A⇒B) = 1-Support( B) / 1-Confidence( A⇒B)

Maximum Length: It refers to the maximum number of items that can be included in a set. Which means the length of both antecedent and consequent.

Algorithms Used

To find the frequent Itemset, users can choose any of the following algorithms.

Apriori

The Apriori algorithm works by first identifying items that are frequently purchased together with the minimum support given by the user. Then using this information to identify larger groups of items that are also frequently purchased together. The Apriori algorithm continues to combine items in this way until it has identified all of the groups of items that are frequently purchased together. These groups of items are known as "association rules" and they can be used to predict which items are likely to be purchased together, and to optimize product displays and sales strategies.

For huge datasets, and limited memory resources, we can make Apriori perform better by setting values True to the "Large Dataset/Low memory?" parameter in the Spotfire.

FP-growth

The FP-growth algorithm is an efficient and scalable alternative to the Apriori algorithm. It can be used to quickly and easily identify frequently purchased groups of items when the dataset is very large and contains many transactions. In these cases, the Apriori algorithm can become computationally intensive and may take a long time to run, while the FP-growth algorithm can quickly identify frequently purchased groups of items.

Sometimes, if the dataset contains many rare items that are not frequently purchased together, Apriori may not be able to find the relationship between these items. In that scenario, FP-growth is an alternative.

Open Source Libraries Used

The necessary Python libraries are pandas, NumPy, and mlxtend. And, you have to use Spotfire's inbuilt Python Tools from the Tools menu to install them (Figure 1).

Visit Using Python packages in Spotfire to learn more details regarding Python setup.

Figure 1: Installation dialogue box in TIBCO Spotfire.

Data

The data used in the analysis is taken from Kaggle. The dataset contains the non-store online retail transaction of various house products between 01/12/2009 and 09/12/2011. The dates of the invoices were advanced to near present time and product categories variables were manually defined to build the final dataset. The default data in the Spotfire template is a public dataset permitted to use for demo purposes, it contains information about customer transactions, indicating the product purchased, the quantity, its category, and price, as well as the date of purchase.

Analysis

The analysis produces a set of metrics, which are: lift, support, and confidence and they can be visualized in different ways. In the table Market Basket Analysis Result of figure 2, the highest lift is at the top. When there is a lift that is higher than 1 it means that it is a good product to recommend, so if people are buying one product (antecedent) you can recommend the next product (consequent). Support is a measure of the presence of these combined purchases within the datasets and confidence is a measure of accuracy.

There are different settings that you can choose to run the analysis. You can set a minimum for lift, a minimum for support, and maximum group size to ensure the outcome of the analysis has appropriate quality. The data scientist who designs and maintains this tool can customize these settings to meet the specific needs of the business users.

There is a Network Chart Mod for Spotfire® in figure 3, this visual is very useful because it displays all the relationships between all the data points in the dataset and it is a good way of exploring where the strong relationships are. The darkest color means the strongest relationship and the size of the node reflects support (popularity of the item).

Figure 2: MBA rules and relationships between different measures.

Figure 3: Network chart and Scatter plot to visualize antecedent and consequence.

For further information about interactive visualization types such as the network chart, see this Spotfire Mods article. If you would like to learn more about MBA you can read this article.

Sign In

Market Basket Analysis Python Data Function for Spotfire® - Documentation

Introduction

Algorithms Used

Open Source Libraries Used

Data

Analysis

Table of contents

User Feedback

Recommended Comments

Industries