Hi,
I have a dataset that includes a string variable with raw, unclean values, eg:
My_string
abc de
defg-hi
jklMNOP
The dirt includes but is not limited to: punctuation, spacing, capitalization, and even typos (like random extra letters inserted into a value).
I also have a lookup table which consists of the set of valid values. In theory, each value in the raw string variable should map to the lookup table.
Lookup
abcde
defghi
jklmnop
What I want to do is convert the dirty values into the clean lookup table values in a way that makes clear what is and is not being matched so I can identify additional cleaning steps that need to happen. Basically, the end product I want is the string variable with the set of valid, clean values that exist in the lookup table.
I know how to do all the string cleanup type parts of this, but am not sure how else to proceed logistically.
Fewer than 200 records, so I assume that means what you propose is a viable solution.
I will play around with it.
Do you have, or expect multiple files to process with the same or similar issues? If not I might be tempted to manually edit if there are only 200 or so values.
Trying to get programming logic to work except for a few case, the punctuation and special characters, might take much more time than the manual edit.
COMPRESS will remove specified characters and has some short cut syntax options to remove digits, punctuation and space characters. Example:
data example; infile datalines missover; informat my_string $10.; input My_string 1-10; length newvar $ 10.; newvar=compress(my_string,,'psd'); datalines; abc de defg-hi jklMNOP abd.?!de pdq345zx ;
If the problem is a matter of uppercase letters that should be lower, or vice versa, then the function Lowcase or Upcase might be in order. Below forces all the letters to be lower case.
data example; infile datalines missover; informat my_string $10.; input My_string 1-10; length newvar $ 10.; newvar=lowcase(compress(my_string,,'psd') ); datalines; abc de defg-hi jklMNOP abd.?!de pdq345zx ;
But unless the random letters are consistent and not used elsewhere then more work is involved.
Likely COMPRESS before attempting any spelling distance approach would help.
Then a SQL cross join and filter out afterwards.
Data lookup;
Input company $30.;
cards;
Vanucover
Reliance
Tata
Tata Motors
;
run;
data rawData;
Input company $30.;
id=_n_;
cards;
Tata
tata
Tataz
TataM Jan2015
Tata Motor
Reliance World
Reliance Ltd
Reliance #Petro
Reliance Global
Vanucoverltd 12 Company
;
run;
proc sql;
create table comparison as select rd.id, rd.company, l.company as matched,
compged(rd.company, l.company, 'ln') as diff_score from rawData rd left join
lookup l on compged(rd.company, l.company, 'ln') < 400 order by id,
diff_score;
quit;
See modifiers (LN) definitions here, may want to add the i modifier.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.