Imagine you are responsible for monitoring your company’s social media news and for quickly addressing customer complaints. Being able to swiftly identify the main themes in those news and in the customer complaints is key today, and automating that process not only increases productivity but also reduces risk and avoids decreased customer satisfaction.
It is important to determine if previous themes are repeating or if new ones are emerging, and identify which themes are present in each document so they can be resolved by the correct department.
In this article, I will show how 1) to implement several Text Analytics Action Sets in SAS Model Studio to identify the main topics from the customer complaints extracted from the Consumer Financial Protection Bureau (CFPB) in 2015, 2) build a topics scoring model, and 3) apply this scoring model to new customer complaints from January 2018 to Feb 2018.
Visual Text Analytics (VTA) in Viya 3.3 has two interfaces to choose from: Model Studio 8.2 and SAS Studio.
SAS® Studio is a SAS developer environment that runs in a Web browser, enabling developers to program and interact with SAS. Every SAS product and solution includes SAS Studio automatically. In this interface one programs Text Mining Actions as I will describe in this post.
Model Studio 8.2 on Viya 3.3 is a central, web-based platform that includes a suite of integrated data mining tools. Model Studio can be used to develop models in:
In previous articles, I described the Visual Text Analytics pipeline, and how easy it is to combine text mining, contextual extraction, categorization, sentiment analysis and search.
Note: Click on the images in this post to enlarge them.
SAS Viya is the SAS platform that enables customers to develop, deploy, and manage using a single platform throughout the Analytics Lifecycle. The underlying engine is called CAS; which stands for Cloud Analytic Services.
A CAS Action is the smallest unit of functionality in CAS. When you submit a CAS action, which can be done using proc CAS in SAS, or through python, Java, or Lua; it sends a request to the server, parsing the arguments of the request, invoking the action function, returning the results and cleaning up resources.
A CAS Action Set is a collection of actions (tasks) that group functionality: for example, session management, table management, Text Mining, etc.
CAS Procedures are executed from a SAS client, such as SAS Studio, and provide a wrapper around a CAS action or action set to perform task(s) in the server.
Action Sets and Actions are important because the same Action Sets and Actions are used no matter the client used to make the request. In this article the examples are worked in CASL; but you could just as easily use Python or Java.
This post shows how to utilize actions implemented in the Text Mining Action Set, which is used:
These are the Actions in the Text Mining Action Set:
tmMine
Derives topics (main themes) from a collection of documents.
It uses other actions from the text mining action set (tpParse, tpAccumulate, tmSvd)
tmScore
Uses the models built using the tmMine action to score new data
tmSvd
Applies a matrix factorization to the output parent table of the accumulation action. It uses the occurrence data from the entire collection to produce a best-fit, low-dimensional representation that can be used to represent documents and terms as a vector of numbers. This representation can be rotated into topics; which provide a more descriptive set of axes for the coordinate representation.
In this article, the Text Mining Action Set is used to
The code for this implementation can be seen below in the Code window. The main parts of that code are:
The code produces several output tables which identify the topics from the 2015 documents, and how those topics are present in the new document collection. Also, the Singular Value Decomposition (SVD) produces three main matrices: a term-by-topic matrix, a matrix of topic importance values, and a document-by-topics matrix.
Important: remember that in the text mining action tmMine the parameter K was set to K=3, so we expect to see up to three terms.
In the photo below, we can see which topics were identified in the 2015 customer complaints. Zero or more of these topics will be assigned to each of the 2018 customer complaints.
The table SVDU is term-by-topic matrix, in which each row corresponds to a term, _TermNum_ is the term’s number, and the elements of this matrix can be interpreted as relevance weights—they describe the relationship of each term to each topic—and these relationships help you interpret the derived themes. Each theme is a linear combination of terms, so it is customary to label the topics by using the terms that have the highest weights
The TERMS table is shown below (you can relate each term with its specific _Index_, or _termnum_, if you wish
The DOCPRO table shows the relationship of each document to each topic. It is the document to topic matrix, where the col1 to col3 variables indicate the linear combination of Topics to the document, and which topics are most relevant to each document. For example, the first row indicates that for the complaint with ID 1290183, the first topic (loan, +modification, +mortgage) is the most relevant to it. The 2nd row indicates that complaint with ID 1290253 can’t be expressed as a linear combination of the 3 themes.
The results of applying the topic scoring model to the new customer complaints is shown in the table scoreDocpro, where we can see the linear combination of Topics to the document and which topics are most relevant to each complaint.
One can quickly (and easily) identify the main themes in a document collection using Text Analytics Action Sets in SAS Model Studio.
SAS Visual Analytics 8.2: Programming Guide
Ray Wright, Temporal Text Mining: A Thematic Exploration of Don Quixote, SAS Global Forum Paper SAS0523-2017
Albright, R. 2004. “Taming Text with the SVD.” Cary, NC: SAS Institute Inc.
/* Score new data set to find its Topics */
/*****************************************************************************/
/* Start a cas session named mysess using the existing CAS server connection */
/* while allowing override of caslib, timeout (in seconds), and locale */
/* defaults. */
/*****************************************************************************/
cas mysess sessopts=(caslib=casuser timeout=1800 locale="en_US" metrics=true);
/*****************************************************************************/
/* Create SAS librefs for existing caslibs */
/* so that they are visible in the SAS Studio Libraries tree. */
/*****************************************************************************/
caslib _all_ assign;
/* Training data from 2015 is created using the DATA step */
data casuser.reviews;
set analytic.codingcomplaints;
run;
/* Public stop list from SAS HELP is used */
proc casutil;
load casdata="engstop.csv"
incaslib="ANALYTIC" outcaslib="ANALYTIC" casout="engstop";
data casuser.engstop;
set analytic.engstop;
run;
/* load into cas the data to be scored from 01/01/2018 to 02/07/2018 */
proc casutil;
load casdata="scoreComplaints.csv"
incaslib="ANALYTIC" outcaslib="ANALYTIC" casout="scoreComplaints";
run;
data casuser.scoreComplaints;
set analytic.scoreComplaints;
run;
/* The topics are discovered and document projections */
/* made using the tmMine action */
proc cas;
loadactionset "textMining";
action tmMine;
param
docId="Complaint_ID"
documents={ name="reviews"}
text="Consumer_complaint_narrative"
nounGroups=False
tagging = False
stopList ={ name="engstop"}
parseConfig={name="config", replace=TRUE}
parent ={ name="parent",replace=TRUE}
offset ={name="offset",replace=TRUE}
terms ={ name="terms", replace=TRUE}
reduce=1
k=3
docPro ={ name="docpro", replace=TRUE}
topics ={ name="topics", replace=TRUE}
u ={ name="svdu", replace=TRUE}
numLabels=3
topicDecision=True
;
action table.fetch /table="topics", orderBy="_TopicID_"; run;
action table.fetch /table="docpro", orderBy="Complaint_ID"; run;
action table.fetch /table="svdu", orderBy="_TermNum_"; run;
run;
quit;
/* scoring Document made using tmScore based on training data */
proc cas;
loadactionset "textMining";
action tmScore;
param
docId="Complaint ID"
documents={name="scoreComplaints"}
text="Consumer complaint narrative"
terms={name="terms"}
parseConfig={name="config"}
u={name="svdu"}
docPro ={ name="scoreDocpro", replace=TRUE}
topics={name="topics"}
topicDecision=True
;
action table.fetch /table="scoreDocpro"; run;
run;
quit;
/* If you did not calculate the SVD initially, you can do it using the parent table as input */
proc cas;
loadactionset "textMining";
action tmSvd;
param
parent={ name="parent"}
terms={name="terms"}
k=3
u ={ name="svdu", replace=TRUE}
numLabels=3
topics={name="topicsSVD",replace=TRUE}
;
action table.fetch /table="topicsSVD"; run;
run;
quit;
/*****************************************************************************/
/* Up to this step the tables are in CAUSERH, */
/* if I want to move them to the analytic library */
/* they most be promoted */
/*****************************************************************************/
proc casutil outcaslib="ANALYTIC";
promote casdata="docpro";
promote casdata="svdu";
promote casdata="scoreDocpro";
promote casdata="topicsSVD";
quit;
/*cas mysess terminate; */
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
The rapid growth of AI technologies is driving an AI skills gap and demand for AI talent. Ready to grow your AI literacy? SAS offers free ways to get started for beginners, business leaders, and analytics professionals of all skill levels. Your future self will thank you.