Movie Topics with the Highest Viewer Ratings: Restructuring Categories in Visual Text Analytics

2 Likes

Unstructured data, in particular text data, is rapidly growing and practically it is everywhere: blogs, customer comments, veteran claims, tweets, notes from doctors and nurses, etc. Including text data in predictive models improves the predictive power of these models and provides richer insights.

In this article I show how to restructure the categories in the VTA pipeline. Also, I illustrate how in Visual Analytics (VA) one can easily: 1) build a Decision Tree using the unstructured data obtained from the Topics node, and 2) analyze the restructured categories data.

I will use a data set that contains information on 1527 randomly selected movies: their titles, reviews, MPAA Ratings, Main Genre classifications and Viewer Ratings.

With VTA, users can manage the full analytics lifecycle with large volumes of unstructured data; they can access and prepare data, explore document collections, build text models, generate reports, and deploy the models against new unstructured data within their existing systems or processes.

Data Preparation and Data Exploration

Typically when we import text data, there are character strings that are created from that import process. My movie dataset was originally in excel, and when I explored it in VA I noticed that it had the string ‘_x000D_’ attached to several terms, or at the beginning of each Review. I decided to remove it.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In Viya 3.3, it is very easy to navigate between different applications. In the SAS Data Studio – Prepare Data application, I used the code shown in the photo below to remove all occurrences of “_x000D_”

Working in the VTA Pipeline

In the SAS Data Studio – Build Models application, I assigned variable roles as described in my previous VTA posts.

I used the default Text Analytics Pipeline, and only restructured the categories in the Categories node to as described in the next paragraphs.

Categories are organized in a logical, hierarchical structure, also known as a “taxonomy. The purpose of the next steps is to show how to restructure categories, different people would use different subcategories in the next steps. The general idea is to create a new category by combining several subcategories.

I created three new categories: childrenMovies, ActionMovies and MysteryAndSuspense. After the changes, there are three main categories: mainGenre (which has as subcategory childrenMovies), ActionMovies and MysteryAndSuspense.

childrenMovies category

Working in the Categories node, I created the new childrenMovies category using this rule
o (OR,(AND,(NOT,(OR,"adults","adult","suitable for children","rated R","strip@","suck@","crude humor","gore","horror","murder","obscenity","drug use@")),"Wizard of Oz"),(AND,"pixar"),(AND,(OR,"animator","animators")),(AND,(OR,"voiced","voices","voicing","voice"),(OR,"cartoon","cartoons")),(AND,(OR,"cartoon characters","cartoon character")),(AND,(OR,"lesson","lessons"),"animated"),(AND,"live action"),(AND,"jeffrey",(OR,"features","feature")),(AND,"3-d"))
Deleted the original categories of Family, Animation and Kids because as explained in the previous article the new childrenMovies category selected more appropriate children movies than the original categories.

ActionMovies category

Created the new Category ActionMovies, and created several sub-categories by moving under it the categories Action, Bond 007,Martial-Arts, Sports, and War

MysteryAndSuspense category

Created the new Category MysteryAndSuspense, and created several sub-categories by moving under it the categories Crime, Cult, Horror, Mystery and Suspense.

To create subcategories (ex: War) under a category (ex:ActionMovies) these steps were taken:

Right-click on ‘All Categories’ and select ‘Add new category’. Name it ‘ActionMovies’ and hit ENTER.
Right-click on the “War” category, select Cut and Paste it under ‘ActionMovies’ as a sub category.
Repeat step 2 for the subcategories Action, Bond 007,Martial-Arts, and Sports

The Categories node must be run for these changes to be applied.

Saving Output Datasets in VTA

To export the results of the categorization model into a format suitable for visualization and exploration, I exported two output datasets, one from the Topics node and the second data set from the Categories node. The steps are the same, just select the correct node: In the selected node (i.e. Categories), right clicked it and then selected “Save data table”, as shown in the photo below. Save it as, for example, VTA_MovieBlog2_Categories_Data

Build a Decision Tree in VA

In this section we build in VA a predictive model using unstructured data using the output data from the Topics node which was saved as described in the section Saving Output Datasets in VTA.

From the SAS Home, select SAS Visual Analytics – Explore and Visualize Data and open the VTA_MovieBlog2_Topics_Data.

Select from the Object panel the Decision Trees object and drag it into the canvas

On the right, assign Roles and select as Response: ViewerRating, then click on + Add to select the Predictors, and select the variables as shown in the photo below. Do not select variables with levels (e.g._1_0_...)

The topic with the highest Viewer Rating is +comedy,+funny,+joke,+laugh,sandler. Interestingly, the topic with the lowest Viewer Rating is +science,+fiction,+alien,+science Fiction.

Visualize Restructured Categories in Visual Analytics

Continuing with the same report in Visual Analytics, add a new page to it, and select the VTA_MovieBlog2_Categories_Data save in the section Saving Output Datasets in VTA From the Data pane, select + New data item and select Hierarchy.

Name the new hierarchy: Movie Category Hierarchy and select all of the category_level data to the Selected items panel. Click OK.

To analyze categories hierarchy using a bar chart, drag and drop the Movie Category Hierarchy into the canvas which will auto chart a bar chart that shows the different categories. Explore the bar chart (to go down a level in the hierarchy, double click on a bar. To go up a level use the bread crumbs at the top left of the canvas). These are the bar charts produced, notice how they nicely relate to the new categories

Conclusion

In SAS Viya one can easily access and prepare data, explore document collections, build text models, and generate reports. In future articles we will see how to deploy the models against new unstructured data.