Jump to content
  • Sentiment Analysis and Topic Identification using Python data functions in Spotfire


    This article shows how Spotfire 10.7 and later can be used for sentiment analysis and topic identification for text data, using Python packages NLTK and Gensim.

    This article shows how Spotfire 10.7 and later can be used for sentiment analysis and topic identification for text data, using Python packages NLTK and Gensim. Spotfire makes it easy to combine visual analytics and Python's text analytics, making it easy to analyze unstructured text such as customer reviews, service requests, social media comments etc. As can be seen in the screenshot below, you have the full visual analytics capabilities of Spotfire available to analyze the data mined from the text; in this case, we are identifying topics for the entire data set, as well as assigning the best matching topic and estimated sentiment to each input row. If you prefer to watch a video,

    .

    text_and_visual_analytics_with_a_density_map.thumb.png.d6c6cd8bc87dc537686ae850ade03015.png

    Python packages used in this example

    NLTK is a Python package that is used for various text analytics task. We will use it for pre-processing the data and for sentiment analysis, that is assessing wheter a text is positive or negative.

    Gensim is a Python package that implements the Latent Dirichlet Allocation method for topic identification. We use it in order to identify topics in the input text and extract the keywords form those topics.

    Prerequisites: You need to be using Spotfire 10.7 or later and you need to have installed the python packages NLTK and Gensim.

    The data and scenario

    The data in the file is an example data set of customer comments related to a fictious product and company. The data contains unstructured text and location. We will analyze the unstructured text through the Python data function, enrich it with estimated sentiment and topic, and use visual analytics to spot insights in the enriched data.

    The text data looks like this:

    text_data.png.41f31521581311c7e2bac47579aec03a.png

    Data function

    Inputs

    The python data function used takes two input columns

    inputText: the input text to be analyzed

    nrOfTopics: the number of topics to identify using the Latent Dirichlet Allocation (LDA) method implemented in the Gensim package

    The inputText has been mapped to the data column "text", and the nrOfTopics has been mapped to a document property that is shown and editable in a text area.

    Outputs

    estimatedSentiment: the estimated sentiment of the text in each input row as estimated by the "Vader" sentiment analysis package

    TopicKeyWords: The keywords describing the best matching topic for each row of the input data using the LDA method

    topicID: this is a numerical  identifier that is unique per identified topic. We don't use it in this demo, we use teh Topic keywords instead

    We have configured all these outputs to be added as new columns to the original data.

    The Data function exectutes on the input data and is configured to respond to filtering and to automatically execute when filtering is changed or data reloaded. Below is a schematic overview of the data and the data function showing how the text is used as input and how new columns are added to the input data

    schematic_overview_of_data_function_and_data_1.png.4808ef1771a3a1d307fec75daccc416f.png

    example_vis_for_sentiment_and_topics.thumb.png.ba13b1e85116efb2c230d958488e2d72.png

    The new columns produced by the data function can be used to configure visualizations as any other columns in Spotfire. Thus, the data function has translated the unstructured text into structured information we can analyze in Spotfire. Below are two example visualizations showing the estimated sentiment in a histogram and a cross table showing the identified topics described by their keywords, sorted by the average sentiment per topic.

    Attached you find a DXP file that illustrates the details of how this is implemented. 

    Back to the main Python data function page

    sentiment_and_topic_id_short_demo_v7_with_embedded_data.dxp


    User Feedback

    Recommended Comments

    There are no comments to display.


×
×
  • Create New...