This article shows how Spotfire 10.7 and later can be used for sentiment analysis and topic identification for text data, using Python packages NLTK and Gensim. Spotfire makes it easy to combine visual analytics and Python's text analytics, making it easy to analyze unstructured text such as customer reviews, service requests, social media comments etc. As can be seen in the screenshot below, you have the full visual analytics capabilities of Spotfire available to analyze the data mined from the text; in this case, we are identifying topics for the entire data set, as well as assigning the best matching topic and estimated sentiment to each input row. If you prefer to watch a video,
.
Python packages used in this example
NLTK is a Python package that is used for various text analytics task. We will use it for pre-processing the data and for sentiment analysis, that is assessing wheter a text is positive or negative.
Gensim is a Python package that implements the Latent Dirichlet Allocation method for topic identification. We use it in order to identify topics in the input text and extract the keywords form those topics.
Prerequisites: You need to be using Spotfire 10.7 or later and you need to have installed the python packages NLTK and Gensim.
The data and scenario
The data in the file is an example data set of customer comments related to a fictious product and company. The data contains unstructured text and location. We will analyze the unstructured text through the Python data function, enrich it with estimated sentiment and topic, and use visual analytics to spot insights in the enriched data.
The text data looks like this:
Data function
Inputs
The python data function used takes two input columns
inputText: the input text to be analyzed
nrOfTopics: the number of topics to identify using the Latent Dirichlet Allocation (LDA) method implemented in the Gensim package
The inputText has been mapped to the data column "text", and the nrOfTopics has been mapped to a document property that is shown and editable in a text area.
Outputs
estimatedSentiment: the estimated sentiment of the text in each input row as estimated by the "Vader" sentiment analysis package
TopicKeyWords: The keywords describing the best matching topic for each row of the input data using the LDA method
topicID: this is a numerical identifier that is unique per identified topic. We don't use it in this demo, we use teh Topic keywords instead
We have configured all these outputs to be added as new columns to the original data.
The Data function exectutes on the input data and is configured to respond to filtering and to automatically execute when filtering is changed or data reloaded. Below is a schematic overview of the data and the data function showing how the text is used as input and how new columns are added to the input data
The new columns produced by the data function can be used to configure visualizations as any other columns in Spotfire. Thus, the data function has translated the unstructured text into structured information we can analyze in Spotfire. Below are two example visualizations showing the estimated sentiment in a histogram and a cross table showing the identified topics described by their keywords, sorted by the average sentiment per topic.
Attached you find a DXP file that illustrates the details of how this is implemented.
Recommended Comments
There are no comments to display.