Text mining and content categorization

SVD for Text Topic & Text Cluster Nodes

Accepted Solution Solved
Reply
Contributor
Posts: 57
Accepted Solution

SVD for Text Topic & Text Cluster Nodes

I use the following flow with all default settings for my dataset.

 

Input Data Source->Text Parsing->Text Filter->Text Cluster->Text Topic->SAS Code

 

The final output dataset has the following columns:

TextCluster_SVD:1~18  ===> indicating 16 SVD dimensions are created

TextCluster_prob:1~10 ===> indicating 10 clusters are created

TextTopic_raw:1~25 ===> indicating 25  topics are created because the default is 25

TextTopic_:1~25 

 

Since Text Cluster Node precedes Text Topic Node here, are SVD coordinates created in Text Cluster being used by Text Topic Node? Is each TextTopic_raw column actually a SVD coordinate? If this is the case, Text Topic node just creates its own SVD coordinates and doesn't use 16 SVD coordinates created by Text Cluster Node.


Accepted Solutions
Solution
‎07-05-2016 02:45 PM
SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better.  uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

 

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event,  along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

 

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions. 

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

 

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters.  See the doc on it.  Then those seeds become input to the expectation-maximization algorithm as initial locations. 

View solution in original post


All Replies
SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

Each node performs an independent truncated SVD. The cluster node uses the number of dimensions parameter and the heuristic based on the resolution. The topic node computes the same number of dimensions as there are requested automatic topics,...one dimension per topic. They are named differently on the exported data so both sets are there.

 

Russ

Contributor
Posts: 57

Re: SVD for Text Topic & Text Cluster Nodes

Topic node use the same heuristic used by Cluster node? What resolution does Topic node use for generating SVD?
SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

I think the Text Topic node properites refers to them as multi-term topics. These are automatically discovered and the number you specify corresponds to the number of svd dimensions that will be calculated.

Russ

Contributor
Posts: 57

Re: SVD for Text Topic & Text Cluster Nodes

[ Edited ]

Russ,

 

For my dataset, TextCluster_SVD has 18 columns and TextTopic_raw has 25 columns. Then I set # of topics for TextTopic Node to be 18. The output now has 18 TextTopic_raw columns.. But I compare these 18 TextTopic_raw columns to 18 TextCluster_SVD columns and they are different. Why is so? After obtaining SVD columns, does TextTopic node do any further processing to produce TextTopic_raw columns? It seems for TextTopic node, TextTopic_raw coulmns are not the same as its SVD columns. If they are the same, then these 18 TextTopic_raw columns should be the same as 18 TextCluster_SVD columns.

 

 

 

 

SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

Topics rotates the dimensions so they can be better interpretted. This is what you see in the raw values.  If you choose an othogonal rotation then the position of the points relative to one another is the same (if oblique is chosen then this relative position is not maintained.), but their coordinates have changed. This is why they no longer match yet they contain essentially the same information. 

Contributor
Posts: 57

Re: SVD for Text Topic & Text Cluster Nodes

DoesTextCluster node also rotate the SVD dimensions before doing clustering?

SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

No. The rotation is for interpreting dimensions.  For clustering we don't interpret dimensions, we just pass the coordinates on to a clustering algorithm and then use the clusters of document that form to decide what terms tend to distinguish the documents within a given cluster.

Contributor
Posts: 57

Re: SVD for Text Topic & Text Cluster Nodes

Russ, Thanks for the reply!

 

More questions here:

1. In TextCluster output dataset, SVD columnds contain document coordinates projected on SVD dimensions. Is this different from rotation? Also a bit confused with orthogonal rotation you mentioned here.

 

2. For TextCluster node, how are discriptive terms assigned to each cluster generated? Any research paper describing the algorithm?

 

3. For TextTopic node, how are discriptive terms assigned to each topic generated? Any research paper describing the algorithm?

 

4. For TextCluster node, if I set number of clusters to Max., how does SAS TM determine the number of clusters to create?

Solution
‎07-05-2016 02:45 PM
SAS Employee
Posts: 28

Re: SVD for Text Topic & Text Cluster Nodes

1. First the basis columns found in the output of the SVD are rotated, then the projection is made. Think of it as rotating the typical x-y-z axis in three dimensional space so those axis align with the data better.  uncorrelated topics means an orthogonal rotation occurred. See the docs on proc factor for varimax an promax rotations.

 

2. No paper that i know of. See the docs on the cluster node. It uses a binomial probability model to determine terms that are associated with clusters in a way that would suggest it was not a random event,  along with some heuristics to ensure we get terms that have occurred with some frequency before they qualify as a descriptive term

 

3. There are a couple of blog posts. It is using a factor rotation to interpret the dimensions. 

http://blogs.sas.com/content/text-mining/2010/04/16/the-whats-whys-and-wherefores-of-topic-discovery...

http://blogs.sas.com/content/text-mining/2010/04/20/www-of-topic-management-part-2-what-is-a-topic-a...

http://blogs.sas.com/content/text-mining/2010/05/12/part-3-understanding-topic-discovery-from-an-his...

Also see the docs on the topic node.

 

4. I believe it uses proc fasclus which outputs seeds based on a maximum number of clusters.  See the doc on it.  Then those seeds become input to the expectation-maximization algorithm as initial locations. 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 9 replies
  • 875 views
  • 1 like
  • 2 in conversation