For my dataset, Text Cluster Node produces 10 clusters, 16 SVD and 16 Prob columns/variables in the output dataset.
Questions:
1) How are these 10 clusters related 16 SVD variables?
2) Do 16 SVD variables represent "concepts" which are different from clusters?
3) How are Prob variables computed?
SAS Text Miner computes a term-by-document matrix A, where the i-th row and j-th column represents the number of times that the i-th term appears in the j-th document. A has N rows where N is the number of terms in the corpus, and M columns where M is number of documents. You can think of the M columns of A, each of which represents a document, as vectors (or points) in an N-dimensional term-frequency space. The SVD rotates the points so the most variation of your corpus of documents lie in the direction of the first coordinate SVD_1, the second coordinate SVD_2 points in an orthogonal direction that gives the next most variation in the corpus. These vectors represent term-frequency profiles that are convenient. In fact the SVD helps reduce the dimensionality of the problems by only considering a limited number of directions. Since you have 16 SVD vectors, SAS has reduced your corpus to a 16 dimensional subspace of the full term-frequency space.
Clustering is more easily accomplished relative to the SVD coordinate system.
The probability columns represent the likelihood that each document belongs to a cluster. If you have 10 clusters, there should only be 10 probability columns. Each document is assigned to the cluster that corresponds to the maximum likelihood. I am not sure about the details of how the probability is determined. I assume that it is something like a linear discriminant analysis.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.