Text mining and content categorization

How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

Accepted Solution Solved
Reply
New Contributor
Posts: 3
Accepted Solution

How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering? There is no mathematics is given in any of SAS documentation or Help regarding this.


Accepted Solutions
Solution
‎12-12-2016 11:48 PM
SAS Employee
Posts: 28

Re: How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

if K is the number of dimensions used in the clustering, m is the number of docs in the cluster, and err  is  the  sum of the m*k  squared errors, then it looks like it is calculated to be

rmstd = sqrt(err/((m-1)*K)), unless m = 1 and then the value is 0.

View solution in original post


All Replies
Solution
‎12-12-2016 11:48 PM
SAS Employee
Posts: 28

Re: How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

if K is the number of dimensions used in the clustering, m is the number of docs in the cluster, and err  is  the  sum of the m*k  squared errors, then it looks like it is calculated to be

rmstd = sqrt(err/((m-1)*K)), unless m = 1 and then the value is 0.

New Contributor
Posts: 3

Re: How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

Dear Russ,

It is little confusing to me. I am not able to understand "err  is  the  sum of the m*k  squared errors" it will be very helpful if you explain this.

SAS Employee
Posts: 28

Re: How root mean squared standard deviation (RMSSTD) is calculated for Text document clustering?

Each document is a K dimensional vector.

Similarly, the mean of the cluster is a k dimensional vector where each component is an average of the corresponding component for each of the m documents.

A document error is the square root of the sum of the squared differences of each of its k components with each of the  k components of the  mean of the cluster.

The RMSSTD is a an error for the entire cluster so to incorporate all documents from the cluster in this err caculation, it becomes the sum of the squared differences for every component of every document. There are m*k components to sum over in this case.

 

Russ

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 261 views
  • 4 likes
  • 2 in conversation