Three Ways to Generate Topics from Text Data using SAS Viya

1 Like

In this post, I’ll cover three ways to derive Topics from text data using SAS Viya. This post assumes prior knowledge of text analytics such as that which can be gained from the course SAS Visual Text Analytics in SAS Viya.

Generating topics from text data is a common text analytics goal. Topics are themes that exist within a document collection. They are based on terms that frequently appear together within documents. Topics apply to documents, and a single document can satisfy more than one topic. There are a few ways topics can be used in an analysis. Discovering and interpreting topics might be the final goal of a text analytics project. In other applications, topics can be promoted to categories so that Boolean rules are generated which can be used for document classification. The topics also generate numeric columns based on the underlying mathematical algorithm, singular value decomposition, and these generated numeric columns can be used as new input variables for a supervised predictive model.

The three approaches covered in this post for generating text topics will be: using Visual Text Analytics, using Visual Data Mining and Machine Learning, and using Visual Analytics. Although topics could be generated using Visual Text Analytics or Visual Data Mining and Machine Learning through code, here I’ll focus on the point and click interface Model Studio.

For this application we are analyzing a data set that contains a text variable which is feedback from patients who are taking medication to treat depression and anxiety. The name of the data set is Drug Reports, and the text variable is free-form feedback from the patients.

Visual Text Analytics: The Topics node

One of the nodes available in Model Studio when using Visual Text Analytics is Topics. The Topics node is the only node that falls into the Feature Extraction group.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

In the default pipeline, it is the fourth node after the data node.

The default pipeline can be adjusted by the user to meet specific analysis goals. However, if topics are to be generated, a Text Parsing node is required to precede a Topics node in the pipeline. This is so parsing can be done on the document collection to determine which terms are used in the analysis. The terms used in the analysis are known as the Kept terms. The Kept terms are used in forming topics.

There are only a few properties associated with the Topics node.

The user can choose to have the software automatically determine an optimal number of topics or the user can provide a maximum number of topics to be discovered. The default is to automatically determine the number of topics. The user can also change the Term cutoff and the Document cutoff values. Term cutoff controls how many terms are used to define each topic and Document cutoff controls how many documents are assigned to each topic.

When analyzing the Drug Reports data using the default properties, 10 topics are discovered.

Topics discovered cover themes such as side effects while taking medication, withdrawal symptoms when stopping medication, and recovery results while being on medication. The top five terms associated with each topic provide the name for each topic, but each topic is typically made up of many more than just five terms.

Visual Data Mining and Machine Learning: The Text Mining node

When creating a project in Model Studio for Visual Data Mining and Machine Learning, a variable with a role of Target is required. So, although discovering text Topics is an unsupervised application, the project does require a target variable. But of course, this makes sense as Visual Data Mining and Machine Learning’s primary purpose is to build supervised predictive models. In addition, a role of Text is assigned to the variable containing the text documents. In Visual Data Mining and Machine Learning, the node which generates Topics is simply called Text Mining and it is found in the Data Mining Preprocessing group.

There are more properties in the Text Mining node compared to the Topics node in Visual Text Analytics. These additional properties are related to text parsing.

The text parsing properties are used to determine the Kept and Dropped terms tables. These are tables of terms used in and ignored by the analysis, respectively. The text parsing properties turned on by default are to Include Parts of Speech, Extract noun groups, and Stem terms. There is also a default property to ignore terms used in three or fewer documents. The user also can use a custom start list or stop list as well as a synonym list. These properties are found in the Custom Lists group which is collapsed in the screen shot above. Just as with the Topics node in Visual Text Analytics, the user can also choose between a system generated number of topics or a user specified number. In the Text Mining node in Visual Data Mining and Machine Learning the user does not have the ability to change the term or document cutoff.

For this analysis, a simple pipeline was created using only the data node and a Text Mining node.

The results provide several windows of information.

At the top are the Kept and Dropped terms lists. There is also a plot of the frequency for term roles and of course a table of the derived topics. For the Drug Reports data, 10 topics were discovered.

Discovered topics vary and are mostly based on themes related to side effects and recovery. In Visual Data Mining and Machine Learning, the purpose of the Text Mining node is to generate new inputs that can be used to supplement a supervised predictive model. This is why the node is part of the preprocessing group. Creating new inputs to be used in a predictive model is known as feature engineering or feature creation.

Visual Analytics: The Text Topics Object

Visual Analytics is a graphical user interface, but it is designed very differently from Model Studio. Visual Analytics is a tool to assist in exploring data, primarily through visualizations and other interactive tools. It is excellent at generating reports. Unlike Model Studio, when a Visual Analytics report is started, no roles are assigned to variables. Variable roles are assigned as specific visualizations or other tasks are created. The tools used in Visual Analytics to gain insight into data are called Objects. The object in Visual Analytics for topic discovery is called Text Topics. Once a report is started, to find the Text Topics object, expand the Objects short-cut button in the column on the left. The Text Topics object is found under the Analytics group.

When the Text Topics object is selected, variable roles are required.

Required roles (denoted with a red asterisk) are for Document Collection (i.e., the text variable), a Unique identifier, and Language. For the Drug Reports data, the following variable roles were assigned:

Text is the document variable containing patient feedback.

Visual Analytics immediately provides results once roles are assigned. No need to click run; the results are immediate!

The automatically generated topics are shown in the upper left corner of the page. A word cloud of terms based on frequency and a documents pane are also provided. Most topics, again, pertain to side effects, recovery, and withdrawal symptoms. Notice that 6 topics were discovered. This is due to a default property that allows for a maximum of 6 topics to be automatically discovered by the tool. This property can be changed by first clicking the Options short-cut button in the right column and then expanding the Topic Discovery group of properties.

Notice the property Number of terms to use in labels. The default is 4, which means each topic is displayed by showing the top 4 terms used to create it. This differs from Visual Text Analytics where the top 5 terms are used to label each topic.

Topics are a powerful tool for gaining insights into a document collection. Generating topics may be the final goal of a text analysis, but topics can also be used to supplement other models. SAS Viya provides three different ways to generate text topics. Each tool is specifically tailored to match the purpose for generating topics in each analysis.

For more information on how these three tools are used to generate topics, please see product documentation:

Visual Text Analytics: The Topics node

Visual Data Mining and Machine Learning: The Text Mining node

Visual Analytics: the Text Topics object

Find more articles from SAS Global Enablement and Learning here.

Three Ways to Generate Topics from Text Data using SAS Viya

Ready to see what SAS Viya Copilot can do?

SAS AI and Machine Learning Courses