How to Quickly Identify Main Themes in Text Documents

1 Like

Imagine you are responsible for monitoring your company’s social media news and for quickly addressing customer complaints. Being able to swiftly identify the main themes in those news and in the customer complaints is key today, and automating that process not only increases productivity but also reduces risk and avoids decreased customer satisfaction.

It is important to determine if previous themes are repeating or if new ones are emerging, and identify which themes are present in each document so they can be resolved by the correct department.

In this article, I will show how 1) to implement several Text Analytics Action Sets in SAS Model Studio to identify the main topics from the customer complaints extracted from the Consumer Financial Protection Bureau (CFPB) in 2015, 2) build a topics scoring model, and 3) apply this scoring model to new customer complaints from January 2018 to Feb 2018.

Visual Text Analytics in Viya 3.3

Visual Text Analytics (VTA) in Viya 3.3 has two interfaces to choose from: Model Studio 8.2 and SAS Studio.

SAS^® Studio is a SAS developer environment that runs in a Web browser, enabling developers to program and interact with SAS. Every SAS product and solution includes SAS Studio automatically. In this interface one programs Text Mining Actions as I will describe in this post.

Model Studio 8.2 on Viya 3.3 is a central, web-based platform that includes a suite of integrated data mining tools. Model Studio can be used to develop models in:

SAS Visual Data Mining and Machine Learning,
SAS Visual Forecasting, or
SAS Visual Text Analytics.

In previous articles, I described the Visual Text Analytics pipeline, and how easy it is to combine text mining, contextual extraction, categorization, sentiment analysis and search.

Note: Click on the images in this post to enlarge them.

SAS Viya and Text Mining Action Sets

SAS Viya is the SAS platform that enables customers to develop, deploy, and manage using a single platform throughout the Analytics Lifecycle. The underlying engine is called CAS; which stands for Cloud Analytic Services.

A CAS Action is the smallest unit of functionality in CAS. When you submit a CAS action, which can be done using proc CAS in SAS, or through python, Java, or Lua; it sends a request to the server, parsing the arguments of the request, invoking the action function, returning the results and cleaning up resources.

A CAS Action Set is a collection of actions (tasks) that group functionality: for example, session management, table management, Text Mining, etc.

CAS Procedures are executed from a SAS client, such as SAS Studio, and provide a wrapper around a CAS action or action set to perform task(s) in the server.

Action Sets and Actions are important because the same Action Sets and Actions are used no matter the client used to make the request. In this article the examples are worked in CASL; but you could just as easily use Python or Java.

This post shows how to utilize actions implemented in the Text Mining Action Set, which is used:

to discover the main themes and concepts in the document collection
to build a predictive model which uses as input text-variables in the document collection
to compare documents or terms for similarity.

These are the Actions in the Text Mining Action Set:

tmMine

Derives topics (main themes) from a collection of documents.

It uses other actions from the text mining action set (tpParse, tpAccumulate, tmSvd)

tmScore

Uses the models built using the tmMine action to score new data

tmSvd

Applies a matrix factorization to the output parent table of the accumulation action. It uses the occurrence data from the entire collection to produce a best-fit, low-dimensional representation that can be used to represent documents and terms as a vector of numbers. This representation can be rotated into topics; which provide a more descriptive set of axes for the coordinate representation.

In this article, the Text Mining Action Set is used to

Discover the main themes in the CFPB complaints dataset with dates between March 2015 and June 2015 which contains 2,413 complaints.
Score a new data set from the CFPB with complaints collected from January to the 1^st week in February 2018 which contains 2,580 complaints.

The code for this implementation can be seen below in the Code window. The main parts of that code are:

Start a cas session, make caslibs visible in SAS Studio and load data in to CASUSER library
- The customer complaints from 2015 (the training data)
- The stop list and
- The new customer complaints from 2018 (data to be scored)
Using the option metrics=true will print in the log the actions executed
Build a Topics model using the tmMine action. Required parameters are:
- docId: Text Analytics requires a field with unique values that identifies each document of the collection
- text: the field containing the text to be analyzed
- documents: the input CAS table that contains the input documents to analyze and the stop list loaded in step one
Use tmScore to score the new customer complaints from 2018 using the model developed in step two by tmMine
Use tmSvd to produce the TopicsSVD table with the topics relevant to the new data using the tables parent, terms and U, which contain the already parsed collection from the scored file.

Output

The code produces several output tables which identify the topics from the 2015 documents, and how those topics are present in the new document collection. Also, the Singular Value Decomposition (SVD) produces three main matrices: a term-by-topic matrix, a matrix of topic importance values, and a document-by-topics matrix.

Important: remember that in the text mining action tmMine the parameter K was set to K=3, so we expect to see up to three terms.

In the photo below, we can see which topics were identified in the 2015 customer complaints. Zero or more of these topics will be assigned to each of the 2018 customer complaints.

The table SVDU is term-by-topic matrix, in which each row corresponds to a term, _TermNum_ is the term’s number, and the elements of this matrix can be interpreted as relevance weights—they describe the relationship of each term to each topic—and these relationships help you interpret the derived themes. Each theme is a linear combination of terms, so it is customary to label the topics by using the terms that have the highest weights

The TERMS table is shown below (you can relate each term with its specific _Index_, or _termnum_, if you wish

The DOCPRO table shows the relationship of each document to each topic. It is the document to topic matrix, where the col1 to col3 variables indicate the linear combination of Topics to the document, and which topics are most relevant to each document. For example, the first row indicates that for the complaint with ID 1290183, the first topic (loan, +modification, +mortgage) is the most relevant to it. The 2^nd row indicates that complaint with ID 1290253 can’t be expressed as a linear combination of the 3 themes.

The results of applying the topic scoring model to the new customer complaints is shown in the table scoreDocpro, where we can see the linear combination of Topics to the document and which topics are most relevant to each complaint.

Conclusion

One can quickly (and easily) identify the main themes in a document collection using Text Analytics Action Sets in SAS Model Studio.

References

SAS Visual Analytics 8.2: Programming Guide

Ray Wright, Temporal Text Mining: A Thematic Exploration of Don Quixote, SAS Global Forum Paper SAS0523-2017

Albright, R. 2004. “Taming Text with the SVD.” Cary, NC: SAS Institute Inc.

Appendix -- Code

 
/* Score new data set to find its Topics */

/*****************************************************************************/
/*  Start a cas session named mysess using the existing CAS server connection */
/*  while allowing override of caslib, timeout (in seconds), and locale     */
/*  defaults.                                                                */
/*****************************************************************************/

cas mysess sessopts=(caslib=casuser timeout=1800 locale="en_US" metrics=true);

/*****************************************************************************/
/*  Create SAS librefs for existing caslibs */
/*  so that they are visible in the SAS Studio Libraries tree.               */
/*****************************************************************************/

caslib _all_ assign;

/* Training data from 2015 is created using the DATA step */
data casuser.reviews;
set analytic.codingcomplaints;
run;

/* Public stop list from SAS HELP is used */

proc casutil;
load casdata="engstop.csv"
incaslib="ANALYTIC" outcaslib="ANALYTIC" casout="engstop";

data casuser.engstop;
set analytic.engstop;
run;

/* load into cas the data to be scored from 01/01/2018 to 02/07/2018 */
proc casutil;
load casdata="scoreComplaints.csv"
incaslib="ANALYTIC" outcaslib="ANALYTIC" casout="scoreComplaints";
run;

data casuser.scoreComplaints;
set analytic.scoreComplaints;
run;

/* The topics are discovered and document projections */
/* made using the tmMine action */
proc cas;
loadactionset "textMining";
action tmMine;

param
docId="Complaint_ID"
documents={ name="reviews"}
text="Consumer_complaint_narrative"
nounGroups=False
tagging = False
stopList ={ name="engstop"}
parseConfig={name="config", replace=TRUE}
parent ={ name="parent",replace=TRUE}
offset ={name="offset",replace=TRUE}
terms ={ name="terms", replace=TRUE}
reduce=1

k=3

docPro ={ name="docpro", replace=TRUE}
topics ={ name="topics", replace=TRUE}
u ={ name="svdu", replace=TRUE}
numLabels=3
topicDecision=True
;

action table.fetch /table="topics", orderBy="_TopicID_"; run;
action table.fetch /table="docpro", orderBy="Complaint_ID"; run;
action table.fetch /table="svdu", orderBy="_TermNum_"; run;
run;

quit;

/* scoring Document made using tmScore based on training data */
proc cas;
loadactionset "textMining";
action tmScore;

param
docId="Complaint ID"
documents={name="scoreComplaints"}
text="Consumer complaint narrative"
terms={name="terms"}
parseConfig={name="config"}
u={name="svdu"}
docPro ={ name="scoreDocpro", replace=TRUE}
topics={name="topics"}
topicDecision=True
;

action table.fetch /table="scoreDocpro"; run;
run;

quit;

/* If you did not calculate the SVD initially, you can do it using the parent table as input */
proc cas;
loadactionset "textMining";
action tmSvd;

param
parent={ name="parent"}
terms={name="terms"}

k=3

u ={ name="svdu", replace=TRUE}
numLabels=3
topics={name="topicsSVD",replace=TRUE}
;

action table.fetch /table="topicsSVD"; run;
run;

quit;

/*****************************************************************************/
/*  Up to this step the tables are in CAUSERH,                               */
/*  if I want to move them to the analytic library                           */
/*  they most be promoted                                                    */
/*****************************************************************************/

proc casutil outcaslib="ANALYTIC";
promote casdata="docpro";
promote casdata="svdu";
promote casdata="scoreDocpro";
promote casdata="topicsSVD";
quit;

/*cas mysess terminate; */

PatriciaNeri · ‎07-24-2018

Blanco13,

As I described in this blog, using Visual Text Analytics (VTA) one can identify the main themes in a document collection. You can use the visual interface or the programming interface. This blog describes how to do what you want to do using the programming interface.

If you prefer to use the visual interface, check this blog:
http://sww.sas.com/blogs/wp/gate/19588/discover-main-topics-on-mlkdayofservice-tweets-using-sas-visu...

If you have VTA, I could help you to quickly identify the main themes and we could even write a blog about it. Let me know,

Patricia