Text Topic Modeling

1 Like

Hello, and welcome to my blog! The purpose of this blog is to encourage you to experiment with some of the options that are available for taking your Text Topic Modeling beyond the basics. If you are new to Visual Text Analytics, see my article on Getting Started with Text Analytics to get an idea of the capabilities.

Starting with a collection of documents, these can include customer call reports, patients’ comments, product reviews or movie recommendations, SAS provides the Natural Language Processing (NLP) capabilities that can turn the documents into useful insights.

The Topics node is usually inserted near the end of a text processing pipeline. This node provides you with controls that regulate the topic discovery process. Controls include setting cutoff thresholds for terms and documents. A lower threshold captures more items while a higher threshold captures fewer items. The threshold is based on the number of standard deviations above the mean term or document weight.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Note that when running a Topics node, some documents will likely not fit into any of the generated topics, and they will be accounted for in a group designated as “No Matching Topic”. Re-running the node with lower thresholds specified will result in fewer documents falling into the “No Matching Topic” group. These settings provide users with adequate controls for extracting reasonable topics when getting started with a text analysis project. Machine generated topics can be combined or split, and additional user-created topics can also be added.

To illustrate this point, I ran the topics node with the default settings, and it returned 11 topics. The right-most bar represents documents that did not fit into any discovered topic.

I then reduced the document cutoff control setting from 1 to 0.7 standard deviations, and now more documents are accounted for in the generated topics as shown below. The 5th bar from the right in the chart below represents the reduced number of documents with no matching topic. A good approach is to examine the results you get by spot checking the documents in your topics and then make adjustments to the thresholds until you are happy with the settings.

The Latent Dirichlet Allocation (LDA) Topic Model

The results above were generated under the default Singular Value Decomposition (SVD) algorithm also known as latent semantic indexing (LSI). If you wish to experiment with generating topics using a different algorithm, the Latent Dirichlet Allocation (LDA) Topic Model technique is also available. Running LDA for topic generation may produce different topics and document assignments. It utilizes Bayesian and Expectation Maximization techniques. Now, let's discuss how we'd implement it by calling a CAS Action in syntax.

Implementing the Latent Dirichlet Allocation algorithm: ldaTopic.ldaTrain action creates topics from a document collection using just a document ID, input table, and the variable name containing the documents at a minimum. It defaults to creating only 2 topics, but this can be changed using the k= parameter. In my movie description example, I asked for k=5 topics to be created.

From the documentation, the minimum syntax needed to run with default settings is shown here. This code would run in SAS Studio.

/*  Connect to CAS and load the data table.  */

options casport=5570 cashost="cloud.example.com";     /**/
cas casauto;
caslib _all_ assign;

proc cas;                                             /**/
   session casauto;

   table.loadTable result=r /                         /**/
      caslib="Library_name_to_use"
      path="Path_to_my_file"
      casOut={name="CAS_file_name", replace=true};
   run;

/*  Run the ldaTrain action – this section defines the output table names  */

   ldaTopic.ldaTrain /                                /**/
      casOut={name="Topics",
              replace=TRUE
             }
      docDistOut={name="DocDist",
                  replace=TRUE
                 }

/*  This section defines the input table name, id variable and text variable  */

      docId="D_ID_variable_name"
      table={name="Table_name"}
      text={{name="T_variable_name"}};
   run;
quit;

After running code based on this outline, the following output table shows the proportion of terms for each topic that was found in a document. There were 5 topics generated, numbered 0-4, and each document id has a proportion value for each topic.

The following output table shows what topic (0-4) each term was placed into as well as the assigned probability the term will represent the specific topic.

As I mentioned before, I added a parameter k=5 to return 5 topics rather than the default 2 topics when I ran against a movie description document collection. The hyperparameter Alpha with a minimum value of 0 and a default value of .1 specifies the Dirichlet hyperparameter for the document's topic proportion. Beta specifies the Dirichlet hyperparameter for the topic distribution.

Additional optional parameters described in the product documentation let you do additional configuration including adding stop words, modifying the number of iterations, entity identification and stemming.

You can run LDA action sets from CASL, PYTHON, R and Lua, as well as from a SAS STUDIO task for those of you who might prefer to work with a user interface. Let’s take a quick look at SAS Studio for inspiration.

On the left you see the text parsing and topic discovery selection. In the center, the Options tab is selected as is the LDA radio button. On the right you see the generated code calling the ldaTopic.ldaTrain action. Even if you are a ‘coder’, the SAS Studio task is a great way to quickly get a working copy of code showing correct syntax and the placement of additional options.

The SAS Studio tasks are straight forward easy to use and can be an inspiration on your way to becoming a text analytic practitioner!

In a subsequent blog we will explore scoring these models and ways to deploy them in a production environment. Until then, I wish you much success with your analysis and may all your topics be true!

Text Topic Modeling

Free course: Data Literacy Essentials

Get Started