Unstructured data, in particular text data, is rapidly growing and practically it is everywhere: blogs, customer comments, veteran claims, tweets, notes from doctors and nurses, etc. Including text data in predictive models improves the predictive power of these models and provides richer insights.
In this article I show how to restructure the categories in the VTA pipeline. Also, I illustrate how in Visual Analytics (VA) one can easily: 1) build a Decision Tree using the unstructured data obtained from the Topics node, and 2) analyze the restructured categories data.
I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings.
With VTA, users can manage the full analytics lifecycle with large volumes of unstructured data; they can access and prepare data, explore document collections, build text models, generate reports, and deploy the models against new unstructured data within their existing systems or processes.
Typically when we import text data, there are character strings that are created from that import process. My movie dataset was originally in excel, and when I explored it in VA I noticed that it had the string ‘_x000D_’ attached to several terms, or at the beginning of each Review. I decided to remove it.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
In Viya 3.3, it is very easy to navigate between different applications. In the SAS Data Studio – Prepare Data application, I used the code shown in the photo below to remove all occurrences of “_x000D_”
In the SAS Data Studio – Build Models application, I assigned variable roles as described in my previous VTA posts.
I used the default Text Analytics Pipeline, and only restructured the categories in the Categories node to as described in the next paragraphs.
Categories are organized in a logical, hierarchical structure, also known as a “taxonomy. The purpose of the next steps is to show how to restructure categories, different people would use different subcategories in the next steps. The general idea is to create a new category by combining several subcategories.
I created three new categories: childrenMovies, ActionMovies and MysteryAndSuspense. After the changes, there are three main categories: mainGenre (which has as subcategory childrenMovies), ActionMovies and MysteryAndSuspense.
Created the new Category ActionMovies, and created several sub-categories by moving under it the categories Action, Bond 007,Martial-Arts, Sports, and War
Created the new Category MysteryAndSuspense, and created several sub-categories by moving under it the categories Crime, Cult, Horror, Mystery and Suspense.
To create subcategories (ex: War) under a category (ex:ActionMovies) these steps were taken:
The Categories node must be run for these changes to be applied.
To export the results of the categorization model into a format suitable for visualization and exploration, I exported two output datasets, one from the Topics node and the second data set from the Categories node. The steps are the same, just select the correct node: In the selected node (i.e. Categories), right clicked it and then selected “Save data table”, as shown in the photo below. Save it as, for example, VTA_MovieBlog2_Categories_Data
In this section we build in VA a predictive model using unstructured data using the output data from the Topics node which was saved as described in the section Saving Output Datasets in VTA.
From the SAS Home, select SAS Visual Analytics – Explore and Visualize Data and open the VTA_MovieBlog2_Topics_Data.
Select from the Object panel the Decision Trees object and drag it into the canvas
On the right, assign Roles and select as Response: ViewerRating, then click on + Add to select the Predictors, and select the variables as shown in the photo below. Do not select variables with levels (e.g._1_0_...)
The topic with the highest Viewer Rating is +comedy,+funny,+joke,+laugh,sandler. Interestingly, the topic with the lowest Viewer Rating is +science,+fiction,+alien,+science Fiction.
Continuing with the same report in Visual Analytics, add a new page to it, and select the VTA_MovieBlog2_Categories_Data save in the section Saving Output Datasets in VTA From the Data pane, select + New data item and select Hierarchy.
Name the new hierarchy: Movie Category Hierarchy and select all of the category_level data to the Selected items panel. Click OK.
To analyze categories hierarchy using a bar chart, drag and drop the Movie Category Hierarchy into the canvas which will auto chart a bar chart that shows the different categories. Explore the bar chart (to go down a level in the hierarchy, double click on a bar. To go up a level use the bread crumbs at the top left of the canvas). These are the bar charts produced, notice how they nicely relate to the new categories
In SAS Viya one can easily access and prepare data, explore document collections, build text models, and generate reports. In future articles we will see how to deploy the models against new unstructured data.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.