- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have two datasets and below is simple example where two datasets exist "dataset a" and "dataset b" now my objective is to do lookup using a pattern string from "dataset b" and look for matches in "dataset a" ...Tried with SQL as below however doesn't seems to work
PROC SQL;
CREATE TABLE LOOKUP AS
SELECT A.*
FROM TMS A,
FILT B
WHERE A.DSN LIKE B.DSN;
QUIT;
Dataset A (variable name is DSN)
ZZZZ.TESTRV.CATT.LENGTH.WORMTEST
Dataset B(Variable name is DSN)
zzzz.TEST%%.CATT.LENGTH
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi! Welcome to the community - hopefully you will get a lot out of participating here.
I think the solution to your problem is the SOUNDEX function that can be used in PROC SQL. Check out this great paper - I have this paper as a bookmark on my computer because I use it all the time.
Basically, SOUNDEX takes your string (say McDonald) and converts it to alphanumeric based on the letters / letter combinations. Say for arguement's sake it converts it to M345. If you compared this to MacDonald, you would get a match - the two are interpreted as the same. You'd need to do further comparisons (first name, date of birth, etc) but at least it gets you started.
I've written a blog post (using SAS University Edition, but the concepts are identical) - check it out here and please let me know if you need further help!!
PROC SQL is a great tool and I'm happy to help out when I can!
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Update to my previous reply:
I was at a Local User Group yesterday and we were talking about preliminary text analysis; I mentioned SOUNDEX, and the presenter said that COMPGED has much better functionality. I’ve never used COMPGED, so decided to dig into it and I must admit – I’m a convert! I wanted to give you updated information so you to can see how cool this is.
I’ve created a dummy data set:
What I want to do is compare the rows in the TEXT column to see how similar the rows are. To do this, I have to join the dataset to itself, and then I want to exclude those rows where the IDs are a match (because it would be the same row compared to itself).
Here’s the code:
proc sql;
select a.text, b.text,
compged(a.text, b.text) as Compged1,
soundex(a.text) as Soundex1,
soundex(b.text) as Soundex2
from work.import a, work.import b
where a.id <> b.id;
quit;
This is a portion of the results:
The lower the COMPGED score, the more similar the sentences. What I find most impressive is that sentences that SOUNDEX says are the same (the first two for example) COMPGED knows there are slight differences, so assigns a score of 100 (This versus Tis) and 200 (test versus taste).
So depending on what you need to do, COMPGED and / or SOUNDEX may be needed. I’d be interested in seeing what you end up using and if you try both, how the results differ!