Hi,
I'm a student in my last year of university, and I'm working on some analysis for my bachelor's thesis.
I'm analyzing a moderately big dataset (20,000 rows for now, but I only have around 1% of the full data set) that includes a variable for item descriptions (e.g. "Cotton short-sleeved t-shirts"), so the question is - does SAS University (basically base SAS and SAS/STAT) have the capability of clustering text in any meaningful way? I'm looking to create a small amount of categories for these items that wouldn't require me figure out how to categorize everything, and then going through each item to assign a category.
If not, is it possible to get my hands on something like SAS Text Miner for free, or should I be looking for a solution elsewhere?
Hi and welcome to the community! I think there's a lot you can do with SAS University Edition - I use it all the time (and, as a bit of self-promotion, even blog every Friday about it - click on my Avatar to see a list :-)).
Check out one of my posts (here) where I use PROC SQL and what are called "regular expressions" to do basic text analytics on Song Titles.
Good luck - I love using PROC SQL and SAS University Edition, so please post back any other problems / questions you have!
Chris
Update to my previous reply:
I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality. I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert! I wanted to give you updated information so you to can see how cool this is.
I’ve created a dummy data set:
What I want to do is compare the rows in the TEXT column to see how similar the rows are. To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).
Here’s the code:
proc sql;
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;
This is a portion of the results:
The lower the COMPGED score, the more similar the sentences. What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).
So depending on what you need to do, COMPGED and / or SOUNDEX may be needed. I’d be interested in seeing what you end up using and if you try both, how the results differ!
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.