Hi all,
I have a text. "ADITYA SASUSER EG" which could be split over multiple lines or it could be in the same line so I am pulling this using
upper(line1)||upper(line2)||upper(line3)||upper(line4) like '%ADITYA%SASUSER%EG%'
but i would like to improvise it more so if the name is ADTYA SASUSR EG or just SASUSER EG or other spelling mistakes, it still gets pulled. Can you please help me with a logic to do this?
Thanks,
Aditya
Hi Aditya,
Please check my code below where I derived two new flags , flag1 and flag2 , flag1 is populated as 1 when there is a spelling mistake and flag2 is populated when there is no complete text like missing ADITYA, so please check and let me know if it meets your requirement.
data have;
input text &:$200.;
cards;
ADTYA
SASUSR
EG
SASUSER EG
;
data want;
set have;
if prxmatch('m/ADITYA|SASUSER|EG/oi',text)=0 then flag1=1;
if prxmatch('m/ADITYA SASUSER EG/oi',text)=0 then flag2=1;
run;
You will probably want to take a look at the COMPGED function, and perhaps the related routine CALL COMPCOST.
The COMPGED calculates the "spelling distance" between two strings, where different operations (insert, append, delete characters, etc) are assigned different values. Using call compcost, you can assign your own "cost" of various operations, and select values that fit your demands.
So, you may want to try out something like this first:
data test;
set have;
spelldist=compged('ADITA SASUSER EC',catx(' ',of line1-line4),1000,'I');
run;
Explanation: 1000 is a cutoff value (if spelling distance is greater than that, the functions stops and returns 1000, in order to speed things up), the 'I' parameter means that the function should disregard case.
Try something like this, see what seems to be a good maximum spelldist value on your output data, and drop (or whatever) the records that exceed that.
A mentioned, if you are really doing serious data cleansing, you may want to customize your criteria by using CALL COMPCOST before calling COMPGED.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.