Text mining and content categorization

Text clustering in SAS

New Contributor
Posts: 3

Text clustering in SAS



I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.

I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.

If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?

Regular Contributor
Posts: 229

Re: Text clustering in SAS

Hi and welcome to the community!  I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).  


Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.  


Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!


Has my article or post helped? Please mark as Solution or Like the article!
Regular Contributor
Posts: 229

Re: Text clustering in SAS

Update to my previous reply:


I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality.  I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert!  I wanted to give you updated information so you to can see how cool this is.


I’ve created a dummy data set:




What I want to do is compare the rows in the TEXT column to see how similar the rows are.  To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).


Here’s the code:



proc sql;
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;


This is a portion of the results:




The lower the COMPGED score, the more similar the sentences.  What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).


So depending on what you need to do, COMPGED and / or SOUNDEX may be needed.  I’d be interested in seeing what you end up using and if you try both, how the results differ!

Has my article or post helped? Please mark as Solution or Like the article!
Frequent Contributor
Posts: 130

Re: Text clustering in SAS

if you have an academic at your university who teaches with SAS they might have registered one or more courses using AWS-cloud based SAS OnDemand for Academics, and depending on the course it might include SAS OnDemand Enterprise Miner with the text miner add-on. An instructor can upload your data and as a registered OnDemand student you could use the Text Miner.
Ask a Question
Discussion stats
  • 3 replies
  • 1 like
  • 3 in conversation