Re: Cleaning up string variable with many values using a lookup table

Walternate · Posted 11-04-2022 04:03 PM

Hi,

I have a dataset that includes a string variable with raw, unclean values, eg:

My_string

abc de

defg-hi

jklMNOP

The dirt includes but is not limited to: punctuation, spacing, capitalization, and even typos (like random extra letters inserted into a value).

I also have a lookup table which consists of the set of valid values. In theory, each value in the raw string variable should map to the lookup table.

Lookup

abcde

defghi

jklmnop

What I want to do is convert the dirty values into the clean lookup table values in a way that makes clear what is and is not being matched so I can identify additional cleaning steps that need to happen. Basically, the end product I want is the string variable with the set of valid, clean values that exist in the lookup table.

I know how to do all the string cleanup type parts of this, but am not sure how else to proceed logistically.

Reeza · Posted 11-04-2022 04:24 PM

How big are your data sets?
You don't really need to clean up because you'll be matching so if you match you can just take the new matched value.

The 'simplest' solution to code is to compare each value to every value in the lookup table and use SOUNDEX or COMPGED to find the closest matches. Then figure out a process to identify which is the correct match. It depends on the results. But this can be computationally intensive so it depends on the size of the datasets. There are some efficient ways, ie loading the lookup table to a temporary array is one as it uses memory and SAS looping to do it efficiently. Hash solutions are another.

Walternate · Posted 11-04-2022 04:26 PM

Fewer than 200 records, so I assume that means what you propose is a viable solution.

I will play around with it.

ballardw · Posted 11-04-2022 05:42 PM

Do you have, or expect multiple files to process with the same or similar issues? If not I might be tempted to manually edit if there are only 200 or so values.

Trying to get programming logic to work except for a few case, the punctuation and special characters, might take much more time than the manual edit.

COMPRESS will remove specified characters and has some short cut syntax options to remove digits, punctuation and space characters. Example:

data example;
   infile datalines missover;
   informat my_string $10.;
   input My_string 1-10;
   length newvar $ 10.;
   newvar=compress(my_string,,'psd');
datalines;
abc de
defg-hi
jklMNOP
abd.?!de
pdq345zx
;

If the problem is a matter of uppercase letters that should be lower, or vice versa, then the function Lowcase or Upcase might be in order. Below forces all the letters to be lower case.

data example;
   infile datalines missover;
   informat my_string $10.;
   input My_string 1-10;
   length newvar $ 10.;
   newvar=lowcase(compress(my_string,,'psd') );
datalines;
abc de
defg-hi
jklMNOP
abd.?!de
pdq345zx
;

But unless the random letters are consistent and not used elsewhere then more work is involved.

Likely COMPRESS before attempting any spelling distance approach would help.

Reeza · Posted 11-04-2022 06:20 PM

Then a SQL cross join and filter out afterwards.

Data lookup;
	Input company $30.;
	cards;
Vanucover
Reliance
Tata
Tata Motors
;
run;

data rawData;
	Input company $30.;
	id=_n_;
	cards;
Tata
tata
Tataz
TataM Jan2015
Tata Motor
Reliance World
Reliance Ltd
Reliance #Petro
Reliance Global
Vanucoverltd 12 Company
;
run;

proc sql;
	create table comparison as select rd.id, rd.company, l.company as matched, 
		compged(rd.company, l.company, 'ln') as diff_score from rawData rd left join 
		lookup l on compged(rd.company, l.company, 'ln') < 400 order by id, 
		diff_score;
quit;

https://www.listendata.com/2016/03/fuzzy-matching-with-sas.html#:~:text=The%20COMPGED%20function%20r....

See modifiers (LN) definitions here, may want to add the i modifier.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p1r4l9jwgatggtn1ko81fyjys4s7.h...

Cleaning up string variable with many values using a lookup table