BookmarkSubscribeRSS Feed
Plikis
Calcite | Level 5

Hi,

 

I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.

I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.

If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?

3 REPLIES 3
DarthPathos
Lapis Lazuli | Level 10

Hi and welcome to the community!  I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).  

 

Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.  

 

Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!

Chris

Has my article or post helped? Please mark as Solution or Like the article!
DarthPathos
Lapis Lazuli | Level 10

Update to my previous reply:

 

I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality.  I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert!  I wanted to give you updated information so you to can see how cool this is.

 

I’ve created a dummy data set:

 

IMAGE10.png

 

What I want to do is compare the rows in the TEXT column to see how similar the rows are.  To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).

 

Here’s the code:

 

 

proc sql;
 
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;

 

This is a portion of the results:

 

image11.png

 

The lower the COMPGED score, the more similar the sentences.  What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).

 

So depending on what you need to do, COMPGED and / or SOUNDEX may be needed.  I’d be interested in seeing what you end up using and if you try both, how the results differ!

Has my article or post helped? Please mark as Solution or Like the article!
Damien_Mather
Lapis Lazuli | Level 10
if you have an academic at your university who teaches with SAS they might have registered one or more courses using AWS-cloud based SAS OnDemand for Academics, and depending on the course it might include SAS OnDemand Enterprise Miner with the text miner add-on. An instructor can upload your data and as a registered OnDemand student you could use the Text Miner.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 2132 views
  • 1 like
  • 3 in conversation