Introduction
Understanding natural language has been a crucial skill for as long as we have been communicating. When performed in human interactions, we not only are listening and interpreting the words that are spoken but we are analyzing and drawing conclusions or suppositions from the tone, volume, and body language used when making this communication.
What happens when we have language communicated in a written form however. There are no tones, timings, or body language to analyse based upon the person directly. This led to many forms of written communication to imply an intent or tone i.e. formal, humourous, accusatory. However, with the explosion in mediums for written communication through websites such as blogs, apps for social media and the worldwide way we can now communicate, how can we interpret natural written text? Furthermore, with the mass volume and sources of this written text, we simply can't cope with the time required to do this manually.
This is where data science again comes into play. Machine learning and artificial intelligence models can be trained and constantly be learning from text. They can provide various services through these data science models such as:
- Analysing the sentiment of any text i.e. how positive, negative the intent of the text is
- Analyse the text for keywords and phrases used
- Analyse for references to known entities i.e. products, names, places etc.
- Instantly translating between languages
These can play a major role in analysing text data for companies such as assessing sentiment from customer interactions or social media interactions such as twitter posts, analysing keywords and sentiment in customer emails, or publications in the media on any topic. It can provide services to remove language barriers for users so text is available in any language instantly, or drive content to users based upon their likes and interests.
Watch my Dr.Spotfire video on this topic on YouTube
Using Spotfire® with Sentiment Analysis and Beyond
Previously I have written about using Spotfire® to produce interactive and highly visual tools for image recognition. This utilised Amazon Web Services (AWS) machine learning called Rekognize. In this blog I want to continue on this theme but expand its usage to perform natural language processing, and text analytics as described above. This time I wanted to compare and contrast the experience of using Microsoft's Azure services to that of Amazon's.
Again, in this blog we will be using the Spotfire Python data function as described before in my blog: https://community.spotfire.com/articles/spotfire/image-recognition-tibco-spotfirer-using-python-and-aws/
Here is a short video of what we are going to build in our blog post today:
Our Example Data
For this blog I choose to use the AirBnB review data which you can download for many cities here: http://insideairbnb.com/get-the-data.html . From here I used the Edinburgh dataset being the most local to me, and downloaded the listings summary data as well as the reviews . Bringing this into Spotfire is incredibly simple as you simple add a local data files:
From there I can use AI recommendations to get an overview of the data easily. For example, using AI recommendations I built this dashboard in very few clicks:
Dashboard built using AI recommendations (shown on left)
Setting up your Environment for Python and Spotfire
Follow these summary steps to set up your environment to run Python through Spotfire
- Since this blog was written, Spotfire now includes ability to run Python - here is the FAQ
- Install Python locally on your machine, or server you run Spotfire from (gotcha - make sure you add Python to your PATH variable!)
- Use Pip to install any libraries you need:
- Pandas is minimum required.
- Boto3 is AWS Python library - pip install boto3
- Azure - pip install azure
- Azure - then needs individual service library installed depending on your service:
- pip install --upgrade azure-cognitiveservices-language-textanalytics
In this blog we are using Comprehend and Translate service from amazon:
- https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/comprehend.html
- https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/translate.html
For Azure we used Cognitive Services for both text and translations:
- https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/quickstarts/python-sdk
- https://learn.microsoft.com/en-us/azure/ai-services/translator/quickstart-text-sdk?pivots=programming-language-python
Building your Text and Sentiment Analytics Spotfire Tool
Here is the machine learning dashboard I built which calls two services in AWS and Azure covering sentiment, key phrase extraction, language detection and translation (to English):
Completed dashboard - calling text analytics in the cloud
To call any cloud service from Spotfire using the Spotfire Python data function follows the same pattern:
Flow for calling cloud services from Spotfire
In Spotfire I register a new Python data function (Tools->Register Data Function).One for AWS and another for Azure. You could combine these into one data function but the advantage of having them separate means you can call them simultaneously, and have better control on when they are called, as well as code management.
(Note that all the code examples and setup required are explained in more detail in this article article: https://community.spotfire.com/s/article/text-analytics-sentiment-analysis-key-phrases-and-translations-spotfire-using-aws-and)
# Copyright (c) 2017-2019 TIBCO Software Inc. All Rights Reserved. from Python_Data_Function import * # Put package imports here # Please make sure you have the correct packages installed in your Python environment import pandas as pd import boto3 comprehend = boto3.client(service_name='comprehend', region_name='eu-west-1') if __name__ == "__main__": ## Empty results list results_list = [] ## Empty df to pass back if no results sentiment_results = pd.DataFrame(columns=(idColumnName,'Mixed', 'Negative', 'Positive', 'Sentiment')) ## Loop text in table - note AWS has a batch mode that may be more efficient to use for index, row in inputTable.iterrows(): if not pd.isna(row[idColumnName]): ## run text analytics text_results = comprehend.detect_sentiment(Text=row[textColumnName], LanguageCode='en') text_results['SentimentScore']['Sentiment'] = text_results['Sentiment'] text_results['SentimentScore'][idColumnName] = int(row[idColumnName]) results_list.append(text_results['SentimentScore']) if len(results_list) > 0: sentiment_results = pd.DataFrame.from_dict(results_list, orient='columns')
In this instance I used the AWS CLI install to authenticate which means I do not need to expose the credentials in the code. However, you can specify credentials in the code in AWS and Azure. My inputs for this data function are defined as here:
And the outputs:
Sending our review data to this function specifying the review_id column as the idColumnName, and the comments column as the textColumnName input, we get a sentiment table such as this from AWS:
Here we can see amazon give you a score from 0 to 1 for each sentiment class they have i.e. mixed, negative, neutral and positive as well as a final sentiment result.
Let's compare this to Azure code and output:
Our Azure code is as follows:
# Copyright (c) 2017-2019 TIBCO Software Inc. All Rights Reserved. from Python_Data_Function import * # Put package imports here # Please make sure you have the correct packages installed in your Python environment #----------------------------------------------------------------------------- #Libraries from azure.cognitiveservices.language.textanalytics import TextAnalyticsClient from msrest.authentication import CognitiveServicesCredentials import pandas as pd import numpy as np #----------------------------------------------------------------------------- # Azure Text Analytics Endpoint Configuration assert keyTextAnalytics # Set credentials credentials_text = CognitiveServicesCredentials(keyTextAnalytics) text_analytics = TextAnalyticsClient(endpoint=endpointTextAnalytics, credentials=credentials_text) #----------------------------------------------------------------------------- if __name__ == "__main__": ## Empty results list sentiment_results_list = [] ## Empty df to pass back if no results sentiment_results = pd.DataFrame(columns=('Sentiment',idColumnName,'Language','Sentiment_Category')) ## Loop text in table for index, row in inputTable.iterrows(): if not pd.isna(row[idColumnName]): ## Convert to required format by azure documents = [{"id": row[idColumnName],"text" : row[textColumnName]}] ## Run Azure sentiment analysis text_sentiment_result = text_analytics.sentiment(documents=documents) sent_result_dict = {} sent_result_dict.update({"Sentiment": text_sentiment_result.documents[0].score, idColumnName: text_sentiment_result.documents[0].id, "Language": 'en'}) ## Azure doesn't define sentiment categories so lets define our own conditions = [(sent_result_dict['Sentiment'] >= 0.6), (sent_result_dict['Sentiment'] > 0.35) & (sent_result_dict['Sentiment'] < 0.6), (sent_result_dict['Sentiment'] <= 0.35)] choices = ['positive', 'neutral', 'negative'] ## Define sentiment category selected sent_result_dict['Sentiment_Category'] = np.select(conditions, choices, default='') sentiment_results_list.append(sent_result_dict) if len(sentiment_results_list) > 0: sentiment_results = pd.DataFrame.from_dict(sentiment_results_list, orient='columns')
Note that this code sets the credentials in the code instead of using the Azure CLI (as we did with AWS). This is purely for comparison in terms of code. Both Azure and AWS can use either method. As I don?t want to expose any credentials in the code, I have added two extra input parameters for the Azure data function, which are the azure key and the azure service endpoint:
Running this code on some reviews returns a table from Azure that looks like this:
Here we do not have a score per sentiment category compared to AWS. This is because whereas AWS returns a JSON object which has a score per sentiment category and the overall selected sentiment, Azure simply returns a score from 0 to 1. It is then up to you as a user to determine how to represent this number. Essentially the closer to 1, the more positive sentiment. In our case we have defined this into 3 groups in the code above negative, neutral and positive from 0 to 0.35, 0.35 to 0.6 and 0.6 and above respectively.
Data Preparation
Of course, all good data science tasks involve data preparation to get the best out of the models. Jobs such as data cleansing, transformations, standardisations etc. are common place, and often followed by feature engineering. However, our data science task is a simple one - analyse natural text. The data format supplied it is already in a form that we can send to AWS and Azure to perform our text analytics so don't need transformations in this case. However, text can contain a lot of extra data which either causes issues for text analytics, or python. For instance you may want to remove stop words, punctuation, stem words or use lemmatization. However, in our case we want to retain this information as it may be important for sentiment analysis. So we simple clean out any extra characters, and new lines to prevent issues in the code and cloud service calls.
Spotfire has an extensive expression language, so we can easily achieve this using regular expressions and creating new columns. Below is the calculated column I used in this example:
Substitute(RXReplace(RXReplace(RXReplace(RXReplace(Substitute([comments],"&","&"),"https\\://.*","","g"),"[\\n\\t\\r]"," ","g"),"[^\\w\\s\\-\\?\\.]*","","g"),"\\?+","?","g"),"?.","?")
This removes non alphanumerics, new lines, tabs etc. as well as getting rid of any urls and standardising ampersands.
Putting it All Together
Just using my AWS or Azure data, I can put together a simple sentiment dashboard tool in Spotfire as shown below:
Simple Sentiment Dashboard
We are using a map to select property reviews to analyse for sentiment and when we select/mark some properties on the map, the reviews are sent to AWS or Azure for analysis. We then create some KPI charts to summarise sentiment, and have a cross table to show the individual reviews with their sentiment.
If you are wondering how I have shown satellite imagery on my Spotfire map, then watch this video by Neil Kanungo:
Expanding to Key phrases, and Translations
We can of course exploit additional services from AWS and Azure to also extract key phrases, entities and translate text between languages. For this we just have to expand the code we already have to handle these options and call the relevant services. Our end product in Spotfire is a tool such as:
Which compares Azure and AWS results side by side as well as doing keywords and translations. Note that I have used property controls in Spotfire and added these to a text area, to tell our Python data functions which cloud services to run:
Giving the user the option of which cloud services to run
We can then pass these to the Python data function as true/false values to check and handle as appropriate.
The inputs and outputs are the same as AWS with the exception of passing the credentials and endpoints needed in this script for authentication (see description above on authentication).
You can also watch a live and full explanation of how these examples work on our YouTube channel:
Please feel free to ask any questions on the Spotfire community with a link to this blog.
License: TIBCO BSD-Style License
Recommended Comments
There are no comments to display.