Text clustering in SAS

Text clustering in SAS



I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.

I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.

If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?

Re: Text clustering in SAS

Hi and welcome to the community!  I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).  


Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.  


Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!


Re: Text clustering in SAS

Update to my previous reply:


I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality.  I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert!  I wanted to give you updated information so you to can see how cool this is.


I’ve created a dummy data set:




What I want to do is compare the rows in the TEXT column to see how similar the rows are.  To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).


Here’s the code:



proc sql;
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;


This is a portion of the results:




The lower the COMPGED score, the more similar the sentences.  What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).


So depending on what you need to do, COMPGED and / or SOUNDEX may be needed.  I’d be interested in seeing what you end up using and if you try both, how the results differ!

Re: Text clustering in SAS

if you have an academic at your university who teaches with SAS they might have registered one or more courses using AWS-cloud based SAS OnDemand for Academics, and depending on the course it might include SAS OnDemand Enterprise Miner with the text miner add-on. An instructor can upload your data and as a registered OnDemand student you could use the Text Miner.
