Decanting Wine Reviews into Insights with Spotfire

Introduction/Overview/Business Use Case

Unstructured data has long been a crucial component for extracting useful information and it is equally challenging to do that. Not only does it have useful information, but we can also extract insights and correlate them back to other structured data. This can make the dataset a whole lot more powerful. Here, we present such an analysis done on wine data.

Spotfire was selected as one of two vendors for a bake-off style showcase at the Forrester Data Strategy & Insights 2019 conference in Austin - Spotfire along with OpenText Magellan. The challenge was to use natural language processing a.k.a text analytics to extract key insights on data from the wine data collected. The analysis pertained to gaining insights from unstructured text. The aim was to demonstrate the power of using NLP and analytics to help learn, investigate unstructured data, and extract actionable insights from them.

Based on the wine dataset, we present a use case that can be imagined for marketing and merchandising where different wine attributes like flavors, tannin levels, etc. can be analyzed, and then combined with additional external data for price and demand forecasting. We can show how easy it is to generate insights from review data and trends to the end of the financial value chain.

Of course, the use case here pertains to wine reviews but it can be generalized to pretty much any kind of vertical as long as you have access to the data!

Background

To get started, we cleaned the data by removing duplicate entries and reviewed a few records manually to understand what information we got out of that. We found that most reviews contained the flavor of the wine and mentioned different types of flavors like fruits, spices, nuts, etc., and also mentioned things like the texture of the wine by mentioning tannin levels. Most of the words were polar, either signifying high strength or softness. The reviews also contained details regarding when the wine should be consumed or until which period it is ideal to be consumed. We decided to extract two main insights from the reviews

Flavors mentioned (Fruits, Spices, Nuts): Helps in judging the taste of the wine.
Tannin Levels (Strong, Balanced, Soft): Helps in understanding the texture of the wine.

In order to extract the insights, we used the Python data function capability in Spotfire along with NLP libraries spaCy and NLTK.

Additionally, to extract more insights, we connected the Wine Reviews data to an E&J Gallo Grape Pricing Dataset which contained historical grape demand and prices from 1994 - 2017. We use the TERR Data Function inside Spotfire to leverage forecasting libraries from open-source CRAN and to help us gain more value out of connecting both the datasets. In the further sections, we describe the data and data functions, go through our methods of analyzing and presenting our data and produce an overall picture of how Spotfire can be used to leverage insights from unstructured text and how effortless it is to connect those to other data within the dataset or even into another data source.

Summary of Software and Open Source Libraries Used

Software Prerequisites:

Spotfire Desktop 10.x and above
- Python via Spotfire's fully integrated Python data function in Spotfire version 10.7 and above (if you are using Spotfire version 10.6 or lower you will need to install the community Python Data Function)
- TERR (Enterprise R) which comes with all versions of Spotfire

Library Prerequisites:

Python - pandas, numpy, spacy, nltk
TERR - Rcpp, mgcv, plyr, SpotfireStats, forecast

Installation Instructions:

Use Spotfire's inbuilt Python and TERR Tools from the Tools menu to install the necessary libraries:

Python - spacy, nltk
TERR - Rcpp, mgcv, plyr, SpotfireStats, forecast

Figure 1 Installation Dialog Box in Spotfire

Data

The Wine Review dataset can be found here on Kaggle. It contains the review, a score from 1-100, wine prices, varieties, region it is found in, etc. You can look at a snapshot of the data below:

Figure 2 Data Overview

Data Functions (R and Python)

There are 5 data functions being used in the Spotfire DXP out of which three are being used for the forecasting part of TERR and two are being used by Python for the Text Analytics.

1. Entity Extraction: This Python data function extracts entities such as NOUN, VERB, ADJ, etc from the wine reviews that can help in extracting things such as flavors, spice flavors and nuttiness in the wine. We use the spaCy library to tag words with their parts of speech

2. Get Previous Words: Given a word, get its previous word from all occurrences in a dataset. This data function is built using Python. We use the document property to define the given word as a dynamic input as you will see later. Thus you can configure the target word you want the previous words for by simply using a text box in Spotfire.

3. Demand Forecasting for Grapes: This TERR Data Function uses the Holt-Winters Algorithm to forecast demand for different grape varieties. We can use the document property to define the grape variety we want the forecast for and run the data function accordingly.

4. Price & Demand Elasticity Models/Calculations: These data functions use forecasting models like Linear and GAM in TERR to predict elasticity curve fitness.

Figure 3 Data Functions in Spotfire

Wine Data Insights

From the snapshot of the wine data, we first extracted entities using the Entity Extraction data function and then used a Wikipedia curated dictionary that was used to perform fuzzy match with all the extracted entities. Thus, we got all the relevant fruits, spices and nuts found in each wine description in the dataset.

Figure 4 Overview of Entity Extraction

To extract tannin levels, we used the Get Previous Words data function to get all the words that occur immediately before the word Tannin in the dataset. We then remove stopwords from the list using the NLTK library in Python. Finally, we use the tagging feature in Spotfire to select the high-frequency words that describe the wine as either soft, balanced or strong and attach that to the dataset.

Figure 5 Tagging of words occurring before Tannin

Now, we are done with using NLP to extract features from the data. We added all extracted features back to the original dataset and came up with the below dashboard.

Figure 6 Insights extracted using NLP

On the top left, we see a nice tree map (shown individually below) with the occurrence frequency of each aromatic descriptor we found in the reviews. You can click on one or more to see how they are distributed geographically or in terms of the tagged tannin levels in the dashboard below. The pie chart shows the tannin levels for the corresponding selection of the aromatic descriptor. Using all of these, we can extract insights into questions such as:

Which country uses a particular aromatic descriptor?
If I am a wine merchandiser wanting to explore the United States, what kind of tannin levels should I focus on? Should I look at strong tannin levels if I want to stock up on wines of this particular aromatic descriptor?
Based on the geographical distribution, which country would be the best as a new market based on my current wine inventory?

As you can see, through the charts we built, we can answer all these questions easily based on requirements and data. To extract more insights and gain more actionable information from the reviews, we connected the wine reviews dataset with an E&J Gallo Winery Case Study which had data on grape prices and demand from California between 1994 to 2017.

Connecting to E&J Gallo Price Dataset

We connected the datasets based on the grape variety found in both of them and this was done entirely in Spotfire using the Data Canvas. , With this new data connected to our dashboard, we were able to come up with two frequency and price distributions across grape varieties to understand the demand and price of each of the common ones.

As you can see above in Figure 6, we can look at the highlighted varieties which have the selected descriptors and get an idea of which are the most expensive or the most in-demand varieties.

From this, a wine merchandiser or marketer could answer questions such as:

What varieties, tannin levels and/or descriptors work for a particular country or a set of countries?
If I want to get into geography, what kind of varieties should I target to stock up?

Price Elasticity Insights

Now that we have a connection between the two datasets, we move on to more prescriptive and predictive analytics. Using the Gallo Case Study data, we came up with a Demand and Price Trends dashboard as shown below:

Figure 7 Pricing & Demand Trends
Using this dashboard, we can explore the trends in demand and price between different grape varieties, and using the slider in the lower left, we can also adjust the period you want to look at. Most of the trends follow an elastic curve, as demand increases, price decreases, and vice versa. On the bottom part, we can see year-by-year price trends for red and white wine varieties and can follow how the trends have changed over the years by hovering over the arrows.

On the bottom right, we can see pricing trends across red and white varieties of wine and we see that the red varieties have a more strongly decreasing trend as price decreases. Surprisingly, demand stays similar across different price points suggesting that either the elastic assumption does not hold here or the curve seems to be non-linear. This can also suggest that some variety prices are insensitive and higher prices are correlated with increased demand.

To go further into predictive analytics, we use the inbuilt Holt-Winters forecast functionality in Spotfire to forecast demand for the years 2018 to 2021. On the left, we can see a highly configurable pane to select different varieties and regions.
We can also include different parameters for the Holt-Winters algorithm in this pane.. To do this, we use a custom Demand Forecast data function rather than the in-built forecast option in Spotfire. Using this, we can look at the demand trend for future years and make decisions based on it.

Figure 8 Pricing & Demand Trends

Now as we saw above, linear relationships can sometimes not show the true trends in the data. In order to confirm/challenge this, we look at Price and Demand Elasticity Curves as shown in the two images below. As expected, we see a reversal of the relationship between Price and Demand in the case of Malbec. The Coast Malbec grapes show increasing demand with increasing prices except for the lowest prices. This suggests that the most expensive grapes should be considered a separate category.

We similarly look at Demand Elasticity. In both cases we can use the left pane to configure the time period, the grape variety as well as an option to choose a linear or a non-linear elasticity model (GAM, in our case). Both of these examples demonstrate the predictive capabilities of Spotfire and the endless customization that is possible by using Python and TERR Data Functions.

Figure 9 Price Elasticity

Figure 10 Demand Elasticity

Lastly, based on the result of the time series on the demand forecasting sheet as well as the model we created above in Figure 7 we came up with a financial planning scenario where the client can make plans (e.g. for grape purchases as well as financial expectations and results). All these are calculated once we finalize and run the demand elasticity model shown above.

Figure 11 Cost Planning

Conclusion

From the above dashboards, we clearly see that Spotfire together with TERR and Python Data Function is a strong combination for conveniently working with structured as well as unstructured data without moving out of the software. This allowed for a unified experience for everything from data exploration, data analysis and predictive analytics. We went from customer sentiment data and trends all the way through the value chain to forecasting and insights vital to business success - all within Spotfire, even automatically combining data sets, and almost completely with clicks. Spotfire brings the power of advanced analytics, even of unstructured data, to non-technical business users as well as to data scientists and analysts.

While our example here focused on Wine data for the Forrester Back Off, the techniques shown can be used in use cases well beyond the wine focus shown here.

Sign In