Solved: Re: SVD for Text Topic & Text Cluster Nodes

aha123 · Posted 05-26-2016 12:34 AM

I use the following flow with all default settings for my dataset.

Input Data Source->Text Parsing->Text Filter->Text Cluster->Text Topic->SAS Code

The final output dataset has the following columns:

TextCluster_SVD:1~18 ===> indicating 16 SVD dimensions are created

TextCluster_prob:1~10 ===> indicating 10 clusters are created

TextTopic_raw:1~25 ===> indicating 25 topics are created because the default is 25

TextTopic_:1~25

Since Text Cluster Node precedes Text Topic Node here, are SVD coordinates created in Text Cluster being used by Text Topic Node? Is each TextTopic_raw column actually a SVD coordinate? If this is the case, Text Topic node just creates its own SVD coordinates and doesn't use 16 SVD coordinates created by Text Cluster Node.

RussAlbright · Posted 07-03-2016 10:19 PM

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better. uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event, along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions.

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters. See the doc on it. Then those seeds become input to the expectation-maximization algorithm as initial locations.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

View solution in original post

RussAlbright · Posted 05-26-2016 02:00 PM

Each node performs an independent truncated SVD. The cluster node uses the number of dimensions parameter and the heuristic based on the resolution. The topic node computes the same number of dimensions as there are requested automatic topics,...one dimension per topic. They are named differently on the exported data so both sets are there.

Russ

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123 · Posted 05-26-2016 02:30 PM

Topic node use the same heuristic used by Cluster node? What resolution does Topic node use for generating SVD?

RussAlbright · Posted 05-26-2016 03:07 PM

I think the Text Topic node properites refers to them as multi-term topics. These are automatically discovered and the number you specify corresponds to the number of svd dimensions that will be calculated.

Russ

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123 · Posted 06-24-2016 10:46 PM

Russ,

For my dataset, TextCluster_SVD has 18 columns and TextTopic_raw has 25 columns. Then I set # of topics for TextTopic Node to be 18. The output now has 18 TextTopic_raw columns.. But I compare these 18 TextTopic_raw columns to 18 TextCluster_SVD columns and they are different. Why is so? After obtaining SVD columns, does TextTopic node do any further processing to produce TextTopic_raw columns? It seems for TextTopic node, TextTopic_raw coulmns are not the same as its SVD columns. If they are the same, then these 18 TextTopic_raw columns should be the same as 18 TextCluster_SVD columns.

RussAlbright · Posted 06-24-2016 11:17 PM

Topics rotates the dimensions so they can be better interpretted. This is what you see in the raw values. If you choose an othogonal rotation then the position of the points relative to one another is the same (if oblique is chosen then this relative position is not maintained.), but their coordinates have changed. This is why they no longer match yet they contain essentially the same information.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123 · Posted 06-25-2016 02:16 AM

DoesTextCluster node also rotate the SVD dimensions before doing clustering?

RussAlbright · Posted 06-27-2016 10:09 PM

No. The rotation is for interpreting dimensions. For clustering we don't interpret dimensions, we just pass the coordinates on to a clustering algorithm and then use the clusters of document that form to decide what terms tend to distinguish the documents within a given cluster.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

aha123 · Posted 06-28-2016 12:54 PM

Russ, Thanks for the reply!

More questions here:

1. In TextCluster output dataset, SVD columnds contain document coordinates projected on SVD dimensions. Is this different from rotation? Also a bit confused with orthogonal rotation you mentioned here.

2. For TextCluster node, how are discriptive terms assigned to each cluster generated? Any research paper describing the algorithm?

3. For TextTopic node, how are discriptive terms assigned to each topic generated? Any research paper describing the algorithm?

4. For TextCluster node, if I set number of clusters to Max., how does SAS TM determine the number of clusters to create?

RussAlbright · Posted 07-03-2016 10:19 PM

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better. uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event, along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions.

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters. See the doc on it. Then those seeds become input to the expectation-maximization algorithm as initial locations.

Register today and join us virtually on June 16!
sasglobalforum.com | #SASGF

View now: on-demand content for SAS users

SAS Innovate 2025: Save the Date