Text Document Clustering in SAS Viya

1 Like

Hello, and welcome to my post! The purpose of this post is to introduce a technique that can be used to generate clusters or groupings of documents from a text document collection.

While generating text topics provides insight into topics occurring in documents, clustering documents organizes your document collection in a different way.

In a previous post, I introduced Text Mining concepts and used SAS Visual Text Analytics in a no code or point-and-click environment to extract information from a document collection. Now, let's explore a way of clustering a document collection using code and capabilities available in SAS Viya.

If you are going to run clustering on a document collection, the objective is to have each document placed into the best cluster segment that fits the content of the document. A document will be counted only once and will be found in only one segment. Most documents contain multiple topics, but the clustering of documents attempts to put similar documents together based on the best, most commonly occurring theme.

I am going to run the tmMine action of the textMining action set to represent a document collection as a matrix of numbers. A term by document matrix almost always has more terms than documents. It has many empty cells and is considered to be a sparse matrix.

This transformation is done by creating SVD (singular value decomposition) projections to describe each document by a vector of numbers. The decomposition effectively reduces the dimension of the assignment problem and makes it easier to classify the documents into like groups. SVD is a technique that finds a best, reduced-dimension space to approximate an original matrix, or in our case - the term by document matrix from the document collection. If you are interested in more details on Singular Value Decomposition refer to SAS Help Center: Text Mining Details

We’ll use the k-means algorithm to put documents into clusters based on the values generated by the projections. More details on singular value decomposition in Text Analysis is given below.

I’ll then use SAS Visual Analytics to combine the original documents with their cluster assignment and assess the results. We’ll use the same drug reports data set of patient's comments on reactions to prescription drugs here that we introduced and described in the previous post.

Note: In Viya, there is a dataSegment action in the Smart Data action set that will cluster numeric or text data. This is an alternate approach you could try on smaller document collections. Since every document collection is different, compare the results of various approaches to see what technique works best for your data.

Process the document collection

Viya gives you a default stop list of words to exclude from text analysis in the ReferenceData caslib if you have a license for Text Analytics. The following code loads this default stop list, and then uses the tmMine action to read the drug_reports data, create 15 topic dimensions (SVD's) and save the numeric projections of the documents in a table named docPro. Topics are stored in a table named topics.

/* Load the documents and create the SVD projections */

cas mySession sessopts=(caslib=casuser timeout=1800 locale="en_US");
cas; 
caslib _all_ assign;

/* load the provided default stoplist */

proc cas;                                            
   loadtable caslib="ReferenceData" path="en_stoplist.sashdat"; 
   run;
quit;

/* load the documents into memory */

proc cas;
   table.fetch / table={caslib="Public", name="Drug_Reports"};
quit;

/* run the tmMine action to create 15 text topics */
/* and save the results as docPro for clustering  */

proc cas;                                            
   loadactionset "textMining";                  
   action tmMine;
   param
   docId="id"
   documents={caslib="Public", name="Drug_Reports"}
   text="drugreport"    /* column name of dataset that has the documents */
   nounGroups= False
   tagging = True
   stemming= True
   stopList ={ name="en_stoplist"}  /* list of terms to exclude from analysis */
   parseConfig={name="config", replace=TRUE}
   parent ={ name="parent",replace=TRUE}
   offset ={name="offset",replace=TRUE}
   terms ={ name="terms", replace=TRUE}  /* list of terms used in Zipf plot */
   reduce=4   /* exclude terms if they are in fewer than 4 documents */
   k=15       /* asking for 15 topic dimensions to be computed  */
   docPro ={ name="docpro", replace=TRUE}  /* document projections for clustering*/
   topics ={ name="topics", replace=TRUE}  /* table of generated topics */
   u ={ name="svdu", replace=TRUE}
   numLabels=5       /* ask for 5 descriptive terms to label each topic */
   topicDecision=True
   ;
   run;

quit;

Table showing derived topics

Document projections from the docPro table to be used to create cluster segments

The document ID is shown on the left and some of the 15 projection columns are shown in the resulting document projection table below.

Cluster the documents

Using the generated document projections, we will cluster the documents into an arbitrary selection of 10 segments using the k-means method available in proc fastclus. K-means is the default clustering method, so we do not have to state it specifically in the code. Other clustering methods are also available. We choose not to include a step to first standardize the projections prior to clustering in this case since they were created using singular value decomposition.

/* create clusters using K-means from the SVD results */

proc fastclus data=CASUSER.DOCPRO maxclusters=10 out=CASUSER.clusters;
	var _Col1_ _Col2_ _Col3_ _Col4_ _Col5_ _Col6_ _Col7_ _Col8_ _Col9_ _Col10_ 
		_Col11_ _Col12_ _Col13_ _Col14_ _Col15_;
run;

Results showing document id at left and the assigned cluster at right.

Merge clusters with original documents.

Now, merge the text with the cluster number to prepare for easy evaluation of the results. Use the "distinct" parameter to match tables on document ID to return one match per document.

/* Use Fedsql to combine the segment numbers with their text documents */
 PROC FEDSQL SESSREF=mySession;
     CREATE TABLE CASUSER."results" AS
     SELECT DISTINCT
        t1."CLUSTER",
        t1."ID",
        t2."DrugReport"
     FROM
        CASUSER."CLUSTERS" t1
           INNER JOIN CASUSER."DRUG_REPORTS" t2 ON (t1."ID" = t2."ID")
     ;
QUIT;

Examine the results.

From Visual Analytics, I created a report to show the frequency of documents in clusters. (steps to create the report are not shown). It also shows the text for documents in each cluster from the results table created in the merge. For this report, I combined a text object and a bar chart object to create an interactive report that displays the documents in each cluster. Hovering over a single line of text will display the entire document in the report for convenience. (hovering action is not shown in the example)

Selecting useful terms for document analysis (Zipf's law).

A plot of terms by frequency is called a Zipf plot based on Zipf's law. It can be used to help you decide what terms are useful to include in your analysis. The plot follows a power law where the most frequently occurring term occurs twice as often as the second most frequently occurring term, and three times more often as the next term etc. Terms that occur in every document are not useful in understanding the document collection. Terms that occur in only a few documents are also not useful. The terms appearing in between these extremes are most helpful.

This first display shows a Zipf plot and a linked word cloud. The shape of the plot shown is what we would expect for a typical document collection in the English language. Not all terms appear in the word cloud since multiple terms have the same frequency counts and only one term is displayed. The largest words in the word cloud occur the most frequently and do not provide much useful insight into the collection on patient’s experience with prescribed drugs.

The following report shows sample words that occur in the highlighted frequency range between 60 to 80 or so and may provide more intuition into understanding the document collection. The word cloud now shows terms within the selected range in the Zipf plot.

You can apply the selection in the Zipf plot as a filter, and then use the selected terms as a start list for further detailed analysis with specialized collections. If you use a start list, only the terms in that list are included in the analysis. Except for special circumstances, the best practice is to use a stop list, which excludes terms from analysis.

In the first post, I demonstrated how topic information can be extracted from a collection of documents using the SAS Visual Text Analytics application. In this post, we see how to first create topics using the Text Mining Action Set, then create document clusters and finally use SAS Visual Analytics to explore the results.

In my next post, I plan to show how you can use custom concepts using both SAS Visual Text Analytics and in code. In the meantime, I hope you enjoyed this post and have some fun experimenting with various techniques to cluster documents. Text analytics can be fun. Keep texting!

SAS Communities Library