BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
aha123
Obsidian | Level 7

I use the following flow with all default settings for my dataset.

 

Input Data Source->Text Parsing->Text Filter->Text Cluster->Text Topic->SAS Code

 

The final output dataset has the following columns:

TextCluster_SVD:1~18  ===> indicating 16 SVD dimensions are created

TextCluster_prob:1~10 ===> indicating 10 clusters are created

TextTopic_raw:1~25 ===> indicating 25  topics are created because the default is 25

TextTopic_:1~25 

 

Since Text Cluster Node precedes Text Topic Node here, are SVD coordinates created in Text Cluster being used by Text Topic Node? Is each TextTopic_raw column actually a SVD coordinate? If this is the case, Text Topic node just creates its own SVD coordinates and doesn't use 16 SVD coordinates created by Text Cluster Node.

1 ACCEPTED SOLUTION

Accepted Solutions
RussAlbright
SAS Employee

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better.  uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

 

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event,  along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

 

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions. 

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

 

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters.  See the doc on it.  Then those seeds become input to the expectation-maximization algorithm as initial locations. 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

View solution in original post

9 REPLIES 9
RussAlbright
SAS Employee

Each node performs an independent truncated SVD. The cluster node uses the number of dimensions parameter and the heuristic based on the resolution. The topic node computes the same number of dimensions as there are requested automatic topics,...one dimension per topic. They are named differently on the exported data so both sets are there.

 

Russ


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123
Obsidian | Level 7
Topic node use the same heuristic used by Cluster node? What resolution does Topic node use for generating SVD?
RussAlbright
SAS Employee

I think the Text Topic node properites refers to them as multi-term topics. These are automatically discovered and the number you specify corresponds to the number of svd dimensions that will be calculated.

Russ


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123
Obsidian | Level 7

Russ,

 

For my dataset, TextCluster_SVD has 18 columns and TextTopic_raw has 25 columns. Then I set # of topics for TextTopic Node to be 18. The output now has 18 TextTopic_raw columns.. But I compare these 18 TextTopic_raw columns to 18 TextCluster_SVD columns and they are different. Why is so? After obtaining SVD columns, does TextTopic node do any further processing to produce TextTopic_raw columns? It seems for TextTopic node, TextTopic_raw coulmns are not the same as its SVD columns. If they are the same, then these 18 TextTopic_raw columns should be the same as 18 TextCluster_SVD columns.

 

 

 

 

RussAlbright
SAS Employee

Topics rotates the dimensions so they can be better interpretted. This is what you see in the raw values.  If you choose an othogonal rotation then the position of the points relative to one another is the same (if oblique is chosen then this relative position is not maintained.), but their coordinates have changed. This is why they no longer match yet they contain essentially the same information. 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123
Obsidian | Level 7

DoesTextCluster node also rotate the SVD dimensions before doing clustering?

RussAlbright
SAS Employee

No. The rotation is for interpreting dimensions.  For clustering we don't interpret dimensions, we just pass the coordinates on to a clustering algorithm and then use the clusters of document that form to decide what terms tend to distinguish the documents within a given cluster.


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123
Obsidian | Level 7

Russ, Thanks for the reply!

 

More questions here:

1. In TextCluster output dataset, SVD columnds contain document coordinates projected on SVD dimensions. Is this different from rotation? Also a bit confused with orthogonal rotation you mentioned here.

 

2. For TextCluster node, how are discriptive terms assigned to each cluster generated? Any research paper describing the algorithm?

 

3. For TextTopic node, how are discriptive terms assigned to each topic generated? Any research paper describing the algorithm?

 

4. For TextCluster node, if I set number of clusters to Max., how does SAS TM determine the number of clusters to create?

RussAlbright
SAS Employee

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better.  uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

 

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event,  along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

 

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions. 

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

 

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters.  See the doc on it.  Then those seeds become input to the expectation-maximization algorithm as initial locations. 


Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 4095 views
  • 1 like
  • 2 in conversation