02-17-2016 02:19 AM
For my dataset, Text Cluster Node produces 10 clusters, 16 SVD and 16 Prob columns/variables in the output dataset.
1) How are these 10 clusters related 16 SVD variables?
2) Do 16 SVD variables represent "concepts" which are different from clusters?
3) How are Prob variables computed?
03-18-2016 02:00 PM
SAS Text Miner computes a term-by-document matrix A, where the i-th row and j-th column represents the number of times that the i-th term appears in the j-th document. A has N rows where N is the number of terms in the corpus, and M columns where M is number of documents. You can think of the M columns of A, each of which represents a document, as vectors (or points) in an N-dimensional term-frequency space. The SVD rotates the points so the most variation of your corpus of documents lie in the direction of the first coordinate SVD_1, the second coordinate SVD_2 points in an orthogonal direction that gives the next most variation in the corpus. These vectors represent term-frequency profiles that are convenient. In fact the SVD helps reduce the dimensionality of the problems by only considering a limited number of directions. Since you have 16 SVD vectors, SAS has reduced your corpus to a 16 dimensional subspace of the full term-frequency space.
Clustering is more easily accomplished relative to the SVD coordinate system.
The probability columns represent the likelihood that each document belongs to a cluster. If you have 10 clusters, there should only be 10 probability columns. Each document is assigned to the cluster that corresponds to the maximum likelihood. I am not sure about the details of how the probability is determined. I assume that it is something like a linear discriminant analysis.