Clustering Twitter data and TF-IDF Matrix

hugo_viga · Posted 08-30-2015 08:30 PM

Hello everyone! I'm currently using SAS Enterprise Miner 12.1 and running into some trouble about how to procede.

I have a data set consisting of tweets, and I intend to create clusters from the information I collected. So far, I've cleaned the data and built a diagram like this:

I also understand that the TF-IDF matrix can be found in the "exported data" option of the Text filter node

(found about it in these 2 other discussion posts

)

Looks like this:

Is this it??

So the question is:

Assuming this is the matrix I need to input to the clustering node as the features vector to perform the clustering algorithm, by simply running the Text Cluster node, will it assume the TF-IDF matrix by default or do I have to change the input somehow? And change the node configuration itself?

In the text filter node I set the Frequency weighting to LOG and the Term weight to IDF.

Thanks in advance!

hugo_viga · Posted 11-17-2015 11:14 AM

hugo_viga ,

First, thanks for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute.

In EM, the Text Parsing node does all the heavy duty initial work ending in frequency matrix. Text Filer node essentially is where most machine-human interaction, subsetting, trimming terms, keep/drop, viewing sterms,... happens. Although the content has been massaged this and that, and certainly exported data sets appear different, the essence remains frequency matrix /query matrix.

In rare cases one benefits from clustering directly on count matrix. In most cases, which I suspect includes your case, you would engage SVD as input into text clustering. I cannot find a machine that runs 12.1. I recall SVD back in 12.1 inside Text Cluster node, the same as 14.1 that I am running now. So the answer to your question is just to connect the TF node to TC node and configure SVD there.

Hope this helps. Best Regards

Jason Xin

Clustering Twitter data and TF-IDF Matrix

Re: Clustering Twitter data and TF-IDF Matrix

Catch up on SAS Innovate 2026