I use the following flow with all default settings for my dataset.
Input Data Source->Text Parsing->Text Filter->Text Cluster->Text Topic->SAS Code
The final output dataset has the following columns:
TextCluster_SVD:1~18 ===> indicating 16 SVD dimensions are created
TextCluster_prob:1~10 ===> indicating 10 clusters are created
TextTopic_raw:1~25 ===> indicating 25 topics are created because the default is 25
TextTopic_:1~25
Since Text Cluster Node precedes Text Topic Node here, are SVD coordinates created in Text Cluster being used by Text Topic Node? Is each TextTopic_raw column actually a SVD coordinate? If this is the case, Text Topic node just creates its own SVD coordinates and doesn't use 16 SVD coordinates created by Text Cluster Node.
1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better. uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.
2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event, along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term
3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions.
Also see the docs on the topic node.
4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters. See the doc on it. Then those seeds become input to the expectation-maximization algorithm as initial locations.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Each node performs an independent truncated SVD. The cluster node uses the number of dimensions parameter and the heuristic based on the resolution. The topic node computes the same number of dimensions as there are requested automatic topics,...one dimension per topic. They are named differently on the exported data so both sets are there.
Russ
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
I think the Text Topic node properites refers to them as multi-term topics. These are automatically discovered and the number you specify corresponds to the number of svd dimensions that will be calculated.
Russ
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Russ,
For my dataset, TextCluster_SVD has 18 columns and TextTopic_raw has 25 columns. Then I set # of topics for TextTopic Node to be 18. The output now has 18 TextTopic_raw columns.. But I compare these 18 TextTopic_raw columns to 18 TextCluster_SVD columns and they are different. Why is so? After obtaining SVD columns, does TextTopic node do any further processing to produce TextTopic_raw columns? It seems for TextTopic node, TextTopic_raw coulmns are not the same as its SVD columns. If they are the same, then these 18 TextTopic_raw columns should be the same as 18 TextCluster_SVD columns.
Topics rotates the dimensions so they can be better interpretted. This is what you see in the raw values. If you choose an othogonal rotation then the position of the points relative to one another is the same (if oblique is chosen then this relative position is not maintained.), but their coordinates have changed. This is why they no longer match yet they contain essentially the same information.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
DoesTextCluster node also rotate the SVD dimensions before doing clustering?
No. The rotation is for interpreting dimensions. For clustering we don't interpret dimensions, we just pass the coordinates on to a clustering algorithm and then use the clusters of document that form to decide what terms tend to distinguish the documents within a given cluster.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
Russ, Thanks for the reply!
More questions here:
1. In TextCluster output dataset, SVD columnds contain document coordinates projected on SVD dimensions. Is this different from rotation? Also a bit confused with orthogonal rotation you mentioned here.
2. For TextCluster node, how are discriptive terms assigned to each cluster generated? Any research paper describing the algorithm?
3. For TextTopic node, how are discriptive terms assigned to each topic generated? Any research paper describing the algorithm?
4. For TextCluster node, if I set number of clusters to Max., how does SAS TM determine the number of clusters to create?
1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better. uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.
2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event, along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term
3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions.
Also see the docs on the topic node.
4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters. See the doc on it. Then those seeds become input to the expectation-maximization algorithm as initial locations.
Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF
View now: on-demand content for SAS users
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.