Let's go to the movies! Four ways to create Categories in SAS Visual Text Analytics.

The purpose of this post is to discuss four ways that Categories can be created using SAS Visual Text Analytics in Model Studio. Don’t worry, as the title indicates, we’ll discuss movies too! Category models are a useful and powerful feature in SAS Visual Text Analytics. A Category simply identifies a group of documents that share a common characteristic. The Categories node uses Boolean rules and proximity operators to identify the common characteristics. The linguistic rules are often easily interpretable. A document either satisfies or does not satisfy the linguistic rules to determine if it falls into the category. The Categories node provides several methods to define categories.

Categories can be based on:

a variable having a role of Category,
a Topic that is promoted to a Category,
a Concept,
user supplied Boolean rules and/or proximity operators.

Some rules may be system generated and other rules can be written by the user. Let’s discuss the four methods and discuss whether the rules for each are system generated or if they must be manually coded. For our discussion we’ll focus on an application which contains a text variable that are summaries of movies. We’ll use the Categories node to investigate and classify some popular movie genres and classify movies based on whether they made money or not. The data set is called Movies_plus and it is taken from the SAS Visual Text Analytics in SAS Viya training course. The data set is used solely for educational purposes and thus the results and conclusions should be treated accordingly. The Text variable in the data set is overview, a character variable Made_Money has been assigned a role of Category and assigned as a display variable, and the variable title has also been set as a display variable. Here’s the data tab for the Visual Text Analytics Movies project:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

The Categories node:

Before we discuss the methods for developing categories, let me briefly discuss the Categories node. In the default Visual Text Analytics pipeline, the Categories node comes last, but it can be placed anywhere within a pipeline. The properties pane is pretty light, with only a single property: Automatically generate categories and rules. The property can be turned on or off by selecting a check box. The property is turned on by default.

Once the node is run, the user has two options. First, the node can be opened, where users can explore or modify system generated linguistic rules, create their own user defined linguistic rules, and explore document matches for categories. Below is the opened Categories node for the Movies project where in addition to the Made_Money category variable, a few other categories exist. We’ll discuss all these categories soon.

The second option after the Categories node is run is for the user to view the Results window. The results include assessment measures (such as precision, recall, F-Measure) for automatically generated Categories, the score code, and a way to view the output data sets (Transactional and Modeling Ready). Here is a partial view of the Results window for the Movies project, showing Assessment plots of diagnostic counts and diagnostic metrics for the categories where linguistic rules were automatically generated:

Now let’s get into the four ways in which Categories are defined.

Automatically Generated Linguistic Rules:

A Category Variable:

On the data tab, any character variable can be assigned a role of Category. When you supply a variable with the role of Category, Boolean linguistic rules are automatically derived for each level of that categorical variable (assuming the default property, Automatically generate categories and rules, is left on). These rules can be easy to understand and can be edited for further customization or refinement. This method is beneficial when you have a predefined set of categories in your data and you want to generate rules based on these categories. As stated above, in our movie summary application, we have a Category variable Made_Money. Made_Money is a binary variable based on the profit a movie generated. Made_Money equals Yes if the movie made money (was profitable) and is No if the movie did not make money. The Categories node will automatically generate linguistic rules for BOTH levels of the category variable. Here is the system generated rule for the Made_Money=Yes category:

Keep in mind that this data is for educational, and not research or business, purposes. According to the system generated rule, apparently movies about a “bond” between people, perhaps James “Bond”, or clumsy people tend to make money! In the Documents window it is seen that 66 of the 2137 movie summaries satisfied the linguistic rules.

06_JT_Categories-Matched-docs-Made-Money-yes-category.png

Of these 66 movies that satisfy the Made_Money=Yes category, 57 actually did make money. This leads to a Precision of 86%. (These values are found in the Results window, which is shown above.)

A quick plot in SAS Visual Analytics reveals that in the raw data 1314 of the movies analyzed actually did make money.

Given that we know 57 of the categorized Made_Money=Yes movies did make money, this leads to a Recall of only 4%. The linguistic rule generated by the system could be adjusted by the user to try to increase this metric.

Below is the system generated linguistic rule for the Made_Money=No category:

According to this system generated rule, if you want a movie to not make money, have it include a character named “Billy”! Keep in mind that although these rules can be interpretable, the resulting rules themselves may not be intuitive.

A Promoted Topic:

Topics in SAS Visual Text Analytics are derived from the document collection and represent the main themes or subjects in your text data. Topics are based on groups of terms that often appear together in the document collection. If you promote a topic to a category, then Boolean linguistic rules are also derived for the topic. This method is useful when you have identified a topic that is significant, and you want to create a category based on this topic. To promote a topic, simply select it in the Topics pane and click the Add topics as categories short-cut button.

In our Movies project, in addition to exploring movies that made money, we also want to explore popular movie genres. Police, murder, and crime movies are very popular and by looking at the system generated topics, we find one that fits this genre. There is a “police” topic where the top five terms that define it are: +killer, +cop, +police, +crime, serial killer. The police topic has been promoted to a Category:

09_JT_Categories-promoted-topic-with-matches-1536x596.png

Notice in the Documents pane that of the 2137 movie summaries, 231 of them fall into the police topic.

Classifying a new movie summary as a police topic is computationally expensive as document weights need to be calculated. For document scoring, it may perhaps be easier and more interpretable to classify movie summaries as police genre based on a category created from this topic. Below is the system generated category rule for the promoted police topic:

This linguistic rule is easily interpretable and makes intuitive sense. Notice here that 227 documents satisfy the category based on the promoted police topic. The category based on the promoted police topic has a Precision of 73% and a Recall of 72%. (These values are found in the Results window, which is shown above.)

Assessment measures such as Precision and Recall can be calculated only for categories based on automatically generated rules. That is because in these situations, there is a “truth” or known categorical value that the predicted category outcome can be compared against. Thus, counts of True Positive and True Negative, for example, can be calculated. For a promoted topic, the true known category is based on a document satisfying the topic in the Topics node.

User Created Linguistic Rules:

Based on a Concept:

Concepts in SAS Visual Text Analytics are entities, facts, events, or key pieces of information in the text data. Concepts are extracted from within documents. Although categories can be created for both predefined and custom concepts, I’ll focus on custom concepts in this post. You can define custom concepts using linguistic and Boolean rules in the Concepts node. The linguistic rules for creating custom concepts are in the form of LITI (Language Interpretation of Textual Information) code. These custom concepts can then be used in creation of custom category rules. This method is beneficial when you want to create categories based on specific concepts that are important in your text data. In our movie example, a popular movie genre of interest is war movies. A custom concept for war movies, called WAR_CONCEPT, was created to extract terms such as soldier, ammunition, action, terrorist, and bomb.

Note in the screen shot below that the custom WAR_CONCEPT concept contained matches in 1497 of the 2137 movie summaries.

In the Categories node, there is a single line of Boolean code that is needed to create a custom category from a concept. It simply requires a Boolean OR statement and the full, case-sensitive name of the concept. In the screen capture below, the name for the category based on the war concept is War Movies.

The number of documents that satisfy the War Movies category will be the same exact number of documents that contained matches for the WAR_CONCEPT concept. This is because the category rule is based directly on the LITI code used in the custom concept. This fact means that assessment measures will not be calculated for this type of category. Note in the screen capture below, that 1497 movie summaries satisfy the War Movies category.

User Created:

The final way to create a category is for the user to write a custom rule based on Boolean and proximity operators. Obviously, knowledge about the document collection is required to write a custom category. This method allows you to create custom categories based on specific criteria. For example, a category rule can be created that matches documents containing certain terms or phrases. These rules can be complex, involving AND, OR, and NOT operators, as well as proximity (i.e., distance) operators that consider the proximity of terms within the text. This method provides the most flexibility and control over the categorization process.

Continuing in our Movies project, another genre of movie to investigate are movies about the well-known, British special agent 007. Bond, James Bond. Without much effort we can create a custom category to capture summaries of James Bond movies. The category is simply named James Bond Movies where domain knowledge indicates that likely summaries will contain the term “bond” and either “james” or “agent”.

For a custom category it may be useful to use the Test Sample Text feature before running the node, especially if the document collection is large. Below we see the James Bond Movies category is tested on a simple, sample text where matches are made and thus the sample document would fall into the James Bond Movies category.

Below, we see that 25 movie summaries satisfy the James Bond category. With domain knowledge, a manual perusal of the matched documents reveals a small number of False Positives, that is, movie summaries that satisfy the linguistic rule for the James Bond category but were not from actual James Bond movies. One such false positive, Johnny English Reborn, is seen below.

Categories provide powerful models to classify documents that share a common characteristic. Category rules are linguistic rules based on Boolean and proximity operators and they may be easily interpreted. Category rules are either auto-generated or can be custom written by the user. I hope this post gets you on your way to using Categories for your next text analytics project! If not, you did at least learn that movie summaries about some guy named Billy, may not lead you to making a fortune in Hollywood.

Additional resources:

Training in Text Analytics

SAS Visual Text Analytics documentation

VTA pipeline overview

Find more articles from SAS Global Enablement and Learning here.

SAS Communities Library