BookmarkSubscribeRSS Feed
adityaa9z
Obsidian | Level 7

Hi all,

 

I have a text. "ADITYA SASUSER EG" which could be split over multiple lines or it could be in the same line so I am pulling this using

 

upper(line1)||upper(line2)||upper(line3)||upper(line4) like '%ADITYA%SASUSER%EG%'

 

but i would like to improvise it more so if the name is ADTYA SASUSR EG or just SASUSER EG or other spelling mistakes, it still gets pulled. Can you please help me with a logic to do this?

 

Thanks,

Aditya

2 REPLIES 2
Jagadishkatam
Amethyst | Level 16

Hi Aditya,

 

Please check my code below where I derived two new flags , flag1 and flag2 , flag1 is populated as 1 when there is a spelling mistake and flag2 is populated when there is no complete text like missing ADITYA, so please check and let me know if it meets your requirement.

 

data have;
input text &:$200.;
cards;
ADTYA 
SASUSR 
EG
SASUSER EG
;

data want;
set have;
if prxmatch('m/ADITYA|SASUSER|EG/oi',text)=0 then flag1=1;
if prxmatch('m/ADITYA SASUSER EG/oi',text)=0 then flag2=1;
run;
Thanks,
Jag
s_lassen
Meteorite | Level 14

You will probably want to take a look at the COMPGED function, and perhaps the related routine CALL COMPCOST.

 

The COMPGED calculates the "spelling distance" between two strings, where different operations (insert, append, delete characters, etc) are assigned different values. Using call compcost, you can assign your own "cost" of various operations, and select values that fit your demands.

 

So, you may want to try out something like this first:

data test;
  set have;
  spelldist=compged('ADITA SASUSER EC',catx(' ',of line1-line4),1000,'I');
run;

Explanation: 1000 is a cutoff value (if spelling distance is greater than that, the functions stops and returns 1000, in order to speed things up), the 'I' parameter means that the function should disregard case.

 

Try something like this, see what seems to be a good maximum spelldist value on your output data, and drop (or whatever) the records that exceed that.

 

A mentioned, if you are really doing serious data cleansing, you may want to customize your criteria by using CALL COMPCOST before calling COMPGED.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 2 replies
  • 393 views
  • 0 likes
  • 3 in conversation