Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Clustering Twitter data and TF-IDF Matrix

Reply
New Contributor
Posts: 2

Clustering Twitter data and TF-IDF Matrix

Hello everyone! I'm currently using SAS Enterprise Miner 12.1 and running into some trouble about how to procede.

I have a data set consisting of tweets, and I intend to create clusters from the information I collected. So far, I've cleaned the data and built a diagram like this:

diagrama.PNG

I also understand that the TF-IDF matrix can be found in the "exported data" option of the Text filter node

(found about it in these 2 other discussion posts

   )

Looks like this:

matriz_tfidf.PNG

Is this it??

So the question is:

Assuming this is the matrix I need to input to the clustering node as the features vector to perform the clustering algorithm, by simply running the Text Cluster node, will it assume the TF-IDF matrix by default or do I have to change the input somehow? And change the node configuration itself?


In the text filter node I set the Frequency weighting to LOG and the Term weight to IDF.

Thanks in advance!

SAS Employee
Posts: 122

Re: Clustering Twitter data and TF-IDF Matrix

hugo_viga ,

 

First, thanks  for using SAS. My name is Jason Xin, advanced analytics solution architect working at SAS Institute.

 

In EM, the Text Parsing node  does all the heavy duty initial work ending in frequency matrix. Text Filer node essentially is where most machine-human interaction, subsetting, trimming terms, keep/drop, viewing sterms,... happens. Although the content has been massaged this and that, and certainly exported data sets appear different, the essence remains frequency matrix /query matrix.

 

In rare cases one benefits from clustering directly on count matrix. In most cases, which I suspect includes your case, you would engage SVD as input into text clustering. I cannot find a machine that runs 12.1. I recall SVD back in 12.1 inside Text Cluster node,  the same as 14.1 that I am running now. So the answer to your question is just to connect the TF node to TC node and configure SVD there.

 

Hope this helps. Best Regards

Jason Xin

Ask a Question
Discussion stats
  • 1 reply
  • 418 views
  • 0 likes
  • 2 in conversation