I used SAS Visual Text Analytics (VTA) to discover the main topics that people tweeted on MLK Day by briefly analyzing tweets on #MLKDayofService. In this blog I will show the steps I took to do that discovery, and other valuable features available in VTA.
Martin Luther King Jr. Day is a US national holiday that honors his legacy in battling for civil rights. Many of us have been moved by his I Have a Dream and his Nobel Peace Prize Acceptance speeches. Many Americans celebrate MLK Day by volunteering, as an answer to one of his famous quotes “Life’s most persistent and urgent question is: What are you doing for others?”
SAS Visual Text Analytics (VTA) is the SAS offering designed to effectively extract insights from unstructured data in large scale. Offered on the SAS Viya architecture, VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules. Currently, VTA supports 30 languages and it has an open architecture supporting 3rd-party programming interfaces.
Note: Click on the images in this blog to enlarge them in a new tab.
With VTA, users can perform the full analytics lifecycle with large volumes of unstructured data: from accessing and preparing data, to building text models, to analyzing results, to successfully deploying the models against new unstructured data. Since VTA is fully integrated with Visual Analytics (VA) one can visually analyze text data at the beginning and end of the analytics lifecycle, as is shown in this blog. One thing I found fascinating about VTA, is that VTA automatically detects Topics and generates Categories. VTA has integrated capabilities that allow you to do contextual extraction, categorization and sentiment analysis to quickly produce insights. These insights can kick off immediate actions, or you can fine-tune them with subject-matter expertise.
In VTA, a pipeline is a process flow diagram whose nodes represent tasks in the Text Analysis Process. In this blog, I illustrate seven main steps that one can utilize when analyzing twitter data. The first four and final steps are done in VA, while steps five and six are done in VTA. Together they illustrate the nice integration between VA and the VTA pipeline.
Twitter’s public API is used to import data into CAS. Use the Prepare Data application and the menu Import to bring twitter data into VA.
You need to specify the search term(s) to import, the maximum tweets to import and whether or not to import retweets. The maximum number of tweets default value is 2000. There are limitations on what data and on how much data that SAS can download using the Twitter public search API. To change users or remove authorization for the account, select Clear Authorization.
Twitter data already has such a column docid. In VA, you can Run Profile and check this fact.
Convert docid from a Measure to a Category variable, and indicate docid is the unique row identifier
As shown in the photo below
1. On the left side bar, click Objects
2. Select Text Topics and drop it into the report page
3. Select English
4. On the right side bar, click Roles
5. To the Document collection, add comments. Select English as the Data Source language and click OK
6. On the right side bar, click Options and select Analyze document sentiment Note One: VTA will auto-generate categories if you add additional categorical variables to Document Details, but variables added must have less than 400 values. My twitter data didn’t have any variable that I could utilize at this step.
Note Two: If your data has emojis it is advisable to preprocess your data to remove them. Version 18W25 will not require this pre-processing.
This photo shows all the out-of-the-box results obtained in the preliminary data analysis done in VA. Save your data.
From the SAS Home menu select the action Build Models that will take you to SAS Model Studio, where you select New Project, and enter the following information.
Once you click Save, you will see two main tabs: Data and Pipelines, and a message indicating “You must assign a variable to the Text role. Assign variable roles”, click in that message and select “body” as your text variable
By selecting the Pipeline tab, you see the default VTA pipeline. Working thru these nodes one does contextual extraction, sentiment analysis and categorization. Machine learning is used to generate the rules in the Topics node and also in the Categories node for the automatic rule creation for categories. With the push towards Artificial Intelligence, SAS will be applying more machine learning techniques, especially deep learning, in future VTA releases. One can add customization in the Concepts, Text Parsing and Categories nodes. Sequentially work with each node, right click on it and select Open
Right click in the Concepts node, and select Open. Concepts are data elements or patterns – such as named entities or fact relationships – that you wish to extract from the larger text field because they match some specific context. VTA provides 9 predefined concepts such as dates, people, places, measurements, mentions of currency which are concepts whose rules are already written to save development time
Below are some examples of the comments matched with nlpDate
In VTA, you can write rules for recognizing concepts that are important to you, thereby creating custom concepts. Close the Concept Node and right click the Text Parsing Node. In the Text Parsing Node, unstructured text is parsed and transformed into the structured form of a vector by using NLP and other tools. It is useful to “drop” some terms that appear in the Kept Terms list because they are not useful for analysis. For example, I dropped http://t.co/, https://t.c, https, http, ‘s, 1942-2016, foxnews, ijessewilliams.
After selecting terms to drop, run the node to see the resulting Kept Terms list
These are the terms that are related to mlkdayofservice
This is the TermMap for the term volunteer, it shows other terms that are most commonly found in other tweets with theterm volunteer across the entire tweet collection
Close the Text Parsing Node and right click the Sentiment Analysis Node.
The Sentiment Analysis Node categorizes the sentiment and opinion about documents, a brand(s), person(s), organization(s), etc. VTA uses a set of proprietary rules that identify and analyze terms, phrases, and character strings that imply sentiment. You could specify a custom Sentiment model if you would like to use it instead of the default model.
Close the Sentiment Analysis Node and right click the Topics Node.
In the Topics Node one can see the automatically generated topics, merge topics and promote them as categories if desired.
This photo shows the auto-generated topics. Notice that up to this point, there has not been any customization just “dropping” terms that are not meaningful. Notice the richness of the auto-generated topics. The first two topics are selected and can be merged and then promoted to be used as Category. Merge the topics indicated, promote them and run the node.
Close the Topics Node and right click the Categories Node.
In the Categories Node, VTA assigns each tweet as belonging to one or more categories using linguistic rules, rather than statistical weighting of terms as with Text Topics. As discussed in Note One above, in this twitter project there will not be auto-generated categories. The photo below shows two categories promoted from the Topics Node, and the rules related from the PersistentQuestions category.
It is possible to create new Categories using LITI and Boolean rules. How to create them will be explored in a future VTA blog. In this blog, I want to show how to export the results of your categorization model for exploration. Run the node. Right click on Categories node and select “Save data table”, and save the data.
From SAS Home open Explore and Visualize Data, open the data saved in the previous step in the Categories node. Create a New data item of type Hierarchy, and add your categories to it and move them into the canvas
For this data, it was useful to add a List Table to the report above in order to see the text reviews that correspond to each category. To add the List Table follow these steps: Click on the List Table, drop it below the _category_ bar chart. Go to Roles at the right side bar and add the “body” column to the table. Go to the Actions menu on the right, select ‘New Action’ and add filter, leave the default options checked and select OK. The photo below shows the “body” for the missing _category_ and by analyzing it, one can select terms that one could use to improve the existing ones, or to develop new customized categories by using LITI and Boolean rules in the Categories Node.
Using SAS Visual Text Analytics (VTA) one can discover the main topics on a collection of tweets. In this blog I briefly analyzed tweets on #MLKDayofService. VTA combines the power of Natural Language Processing (NLP), Machine Learning (ML) and Linguistic Rules and it is fully integrated with Visual Analytics. VTA has integrated capabilities that allow you to do contextual extraction, categorization and sentiment analysis to quickly produce insights.