topic Re: Text clustering in SAS in SAS Data Science

Text clustering in SAS

Plikis — Fri, 26 Feb 2016 08:03:48 GMT

Hi,

I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.

I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.

If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?

Re: Text clustering in SAS

DarthPathos — Wed, 02 Mar 2016 02:00:12 GMT

Hi and welcome to the community! I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).

Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.

Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!

Chris

Re: Text clustering in SAS

DarthPathos — Sat, 05 Mar 2016 16:40:27 GMT

Update to my previous reply:

I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality. I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert! I wanted to give you updated information so you to can see how cool this is.

I’ve created a dummy data set:

What I want to do is compare the rows in the TEXT column to see how similar the rows are. To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).

Here’s the code:

proc sql;
 
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;

This is a portion of the results:

The lower the COMPGED score, the more similar the sentences. What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).

So depending on what you need to do, COMPGED and / or SOUNDEX may be needed. I’d be interested in seeing what you end up using and if you try both, how the results differ!

Re: Text clustering in SAS

Damien_Mather — Fri, 08 Apr 2016 13:00:15 GMT

if you have an academic at your university who teaches with SAS they might have registered one or more courses using AWS-cloud based SAS OnDemand for Academics, and depending on the course it might include SAS OnDemand Enterprise Miner with the text miner add-on. An instructor can upload your data and as a registered OnDemand student you could use the Text Miner.