BookmarkSubscribeRSS Feed
scotthamilton0
Calcite | Level 5

I am trying to perform topic modeling using SAS Enterprise Miner 14.2 and have some text that I have scraped from a website and wanting to use the Text Mining tab and the Text Cluster to look for topics that occur in the data but I am hung up on the option that I have to choose the number of topics that I will be looking for in the data and not the more statistically justifiable method of using a scree plot (or eigenvalue equivalent) to show the natural number of topics that are present in the text. 

 

Does anyone have a work around for this or a solution on how to let the topics naturally originate from the data? 

 

 

3 REPLIES 3
RussAlbright
SAS Employee

 

It sounds like you are using the cluster node to find clusters, rather than the Text Topic node?

 

For the Text Cluster node, the svd dimensions are used to represent each document in k-dimensional space, then the clustering algorithms are used to cluster the documents represented in that space. So the number of SVD dimensions does not correspond to the number of clusters.  Instead, you either choose a number with the exact method, or we use PROC CLUSTER and Wards method to attempt to determine how many clusters there might be up to the maximum number setting. See the docs on Proc Cluster.

 

If you are using the Text Topic node, then there you only specify a  number of clusters. We have found that the scree plot may be helpful on small textbook example problems, but for large data mining problems, it is not usually helpful in determining the number of topics. If you’re still curious, the Text Topic node doesn’t output the singular values, but if you go back to the Text Cluster node they are output. You can run the Text Cluster node setting the number of SVD dimensions to the desired value. Then look for the Textcluster_svd_s data set in your workspace library. That table of singular values is essentially what would have been output in the Text Topic node. You can plot them or scan them to see if they are helpful to you for picking the number of topics. Once you have the value, go back to the Text Topic node and choose it and rerun it.

 

So there are a couple of things, but as you are aware the number of clusters or number of topics contained in a collection can be a very subjective thing. 

 

By the way, there is a Text Analytics community on this website so feel free to participate there in the future.

Russ


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

AnnKuo
SAS Employee

The attached code renders the Scree plot and (Cumulative) Proportion Variance Explained plot using the output table generated by a Text Cluster node.  

 

To use this code, please modify the following two lines:

1. Change the path in the following libname statement.  The libref eigen should point to the <EMProjects>/Workspaces/<Diagram Id> directory:

 

     libname eigen "D:\em_projects\Ann\AnnTM\Workspaces\EMWS6";

 

2.  The data set textcluster_svd_s is a table generated by a Text Cluster node  with Node Id = "TextCluster".  You will replace it with the corresponding node id of your Text Cluster node. 


      set eigen.textcluster_svd_s end=lastobs;

 

/* ------------------------------------------------------------------------------------------
The Code is intended to be used solely as part of a product ("Software") you currently have 
licensed from SAS Institute Inc. or one of its subsidiaries or authorized agents ("SAS"). 
The Code is designed to either correct an error in the Software or to add functionality to
the Software, but has not necessarily been tested.  Accordingly, SAS makes no representation
or warranty that the Code will operate error-free.  SAS is under no obligation to maintain 
or support the Code.

Neither SAS nor its licensors shall be liable to you or any third party for any general, 
special, direct, indirect, consequential, incidental or other damages whatsoever arising out 
of or related to your use or inability to use the Code, even if SAS has been advised of the 
possibility of such damages.

Except as otherwise provided above, the Code is governed by the same agreement that governs 
the Software.  If you do not have an existing agreement with SAS governing the Software, 
you may not use the Code.
-------------------------------------------------------------------------------------------- */
/* Purpose:  Renders the Scree plot and (Cumulative) Proportion Variance Explained plot using 
             the output table generated by a Text Cluster node.

   Note: 	To use this code, please modify the following two lines:

             1. Change the path in the following libname statement.  
			    The libref eigen should point to the <EMProjects>/Workspaces/<Diagram Id> directory:
 
                libname eigen "D:\em_projects\Ann\AnnTM\Workspaces\EMWS6"; 

 
            2.  The data set textcluster_svd_s is a table generated by a Text Cluster node with Node Id = "TextCluster".  
			    You will replace it with the corresponding node id of your Text Cluster node. 

                set eigen.textcluster_svd_s end=lastobs;
----------------------------------------------------------------------------------------------*/		 


/* libref eigen points to the <EMProjects>/Workspaces/<WorkspaceID> directory */
libname eigen "D:\em_projects\Ann\AnnTM\Workspaces\EMWS6";  
                                                                         
ods listing style=HTMLBlue;

%macro Compute(keepflg=1);
data eigenvalues;
   retain sum_eig 0;
   retain ncomp 0;

   /* The data set textcluster_svd_s is a table generated by a Text Cluster node  
       with Node Id = "TextCluster".  You will replace it with the corresponding 
	   node id of your Text Cluster node. */
   set eigen.textcluster_svd_s end=lastobs;
   %if &keepflg %then %do;
      if keep=1 then do;
   %end;
         ncomp + 1;
         sum_eig = sum_eig + svalues;
		 output;
   %if &keepflg %then %do;
      end;
   %end;

   if (lastobs) then do;
      call symput('totalncomp', ncomp);
	  call symput('toteig', sum_eig);
   end;
run;

proc print data=Eigenvalues;
   title "Eigenvalues";
run;

%put Number of compment that are kept= &totalncomp, Sum of the eigenvalues= &toteig;

data varianceComp;
   set eigenvalues;
   Proportion= svalues/&toteig;
   Cumulative = sum_eig / &toteig;
run;

proc print data=varianceComp;
   title "Variance Compoment";
run;
title ;
%mend Compute;

%Compute(keepflg=1);

proc template;
   define statgraph scree;
      begingraph;
         entrytitle "Scree Plot";
         layout overlay/   	  
			yaxisopts=(label="Eigenvalue" gridDisplay=auto_on)
      		xaxisopts=(label="Number of SVD" shortLabel = "SVD" linearopts=(integer=true));;
            seriesplot y=svalues x=ncomp/display=ALL;
         endlayout;
      endgraph;
   end;
run;

proc sgrender data=eigenvalues template=scree;
run;

proc template;
   define statgraph variancecomp;
      begingraph;
         entrytitle "Variance Explained";
         layout overlay / 
              yaxisopts=(label="Proportion" gridDisplay=auto_on)
              xaxisopts=(label="Number of SVD" shortLabel = "SVD" linearopts=(integer=true));
	     seriesplot y=Proportion x=ncomp / 
              display = ALL
              legendlabel="Proportion" name="Proportion";
         seriesplot y=Cumulative x=ncomp /
              lineattrs=graphdatadefault(pattern=dot)
              display = ALL
              legendlabel="Cumulative" name="Cumulative";;
		 DiscreteLegend "Cumulative" "Proportion" /across=1 border=1;
         endlayout;
      endgraph;
   end;
run;

proc sgrender data=varianceComp template=variancecomp;
run;

ods listing;

ScreePlot.pngVarianceExplained.png

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1743 views
  • 0 likes
  • 3 in conversation