Re: Text clustering in SAS

Plikis · Posted 02-26-2016 03:03 AM

Hi,

I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.

I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.

If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?

DarthPathos · Posted 03-01-2016 09:00 PM

Hi and welcome to the community! I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).

Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.

Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!

Chris

Has my article or post helped? Please mark as Solution or Like the article!

DarthPathos · Posted 03-05-2016 11:40 AM

Update to my previous reply:

I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality. I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert! I wanted to give you updated information so you to can see how cool this is.

I’ve created a dummy data set:

What I want to do is compare the rows in the TEXT column to see how similar the rows are. To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).

Here’s the code:

proc sql;
 
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;

This is a portion of the results:

The lower the COMPGED score, the more similar the sentences. What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).

So depending on what you need to do, COMPGED and / or SOUNDEX may be needed. I’d be interested in seeing what you end up using and if you try both, how the results differ!

Has my article or post helped? Please mark as Solution or Like the article!

Damien_Mather · Posted 04-08-2016 09:00 AM

if you have an academic at your university who teaches with SAS they might have registered one or more courses using AWS-cloud based SAS OnDemand for Academics, and depending on the course it might include SAS OnDemand Enterprise Miner with the text miner add-on. An instructor can upload your data and as a registered OnDemand student you could use the Text Miner.

Text clustering in SAS