BookmarkSubscribeRSS Feed
Walternate
Obsidian | Level 7

Hi,

 

I have a dataset that includes a string variable with raw, unclean values, eg:

 

My_string

abc de

defg-hi

jklMNOP

 

The dirt includes but is not limited to: punctuation, spacing, capitalization, and even typos (like random extra letters inserted into a value). 

 

I also have a lookup table which consists of the set of valid values. In theory, each value in the raw string variable should map to the lookup table. 

 

Lookup

abcde

defghi

jklmnop

 

What I want to do is convert the dirty values into the clean lookup table values in a way that makes clear what is and is not being matched so I can identify additional cleaning steps that need to happen. Basically, the end product I want is the string variable with the set of valid, clean values that exist in the lookup table. 

 

I know how to do all the string cleanup type parts of this, but am not sure how else to proceed logistically. 

 

 

4 REPLIES 4
Reeza
Super User
How big are your data sets?
You don't really need to clean up because you'll be matching so if you match you can just take the new matched value.

The 'simplest' solution to code is to compare each value to every value in the lookup table and use SOUNDEX or COMPGED to find the closest matches. Then figure out a process to identify which is the correct match. It depends on the results. But this can be computationally intensive so it depends on the size of the datasets. There are some efficient ways, ie loading the lookup table to a temporary array is one as it uses memory and SAS looping to do it efficiently. Hash solutions are another.
Walternate
Obsidian | Level 7

Fewer than 200 records, so I assume that means what you propose is a viable solution. 

 

I will play around with it. 

ballardw
Super User

Do you have, or expect multiple files to process with the same or similar issues? If not I might be tempted to manually edit if there are only 200 or so values.

Trying to get programming logic to work except for a few case, the punctuation and special characters, might take much more time than the manual edit.

 

COMPRESS will remove specified characters and has some short cut syntax options to remove digits, punctuation and space characters. Example:

data example;
   infile datalines missover;
   informat my_string $10.;
   input My_string 1-10;
   length newvar $ 10.;
   newvar=compress(my_string,,'psd');
datalines;
abc de
defg-hi
jklMNOP
abd.?!de
pdq345zx
;

If the problem is a matter of uppercase letters that should be lower, or vice versa, then the function Lowcase or Upcase might be in order. Below forces all the letters to be lower case.

data example;
   infile datalines missover;
   informat my_string $10.;
   input My_string 1-10;
   length newvar $ 10.;
   newvar=lowcase(compress(my_string,,'psd') );
datalines;
abc de
defg-hi
jklMNOP
abd.?!de
pdq345zx
;

 

But unless the random letters are consistent and not used elsewhere then more work is involved.

Likely COMPRESS before attempting any spelling distance approach would help.

Reeza
Super User

Then a SQL cross join and filter out afterwards.

 

Data lookup;
	Input company $30.;
	cards;
Vanucover
Reliance
Tata
Tata Motors
;
run;

data rawData;
	Input company $30.;
	id=_n_;
	cards;
Tata
tata
Tataz
TataM Jan2015
Tata Motor
Reliance World
Reliance Ltd
Reliance #Petro
Reliance Global
Vanucoverltd 12 Company
;
run;

proc sql;
	create table comparison as select rd.id, rd.company, l.company as matched, 
		compged(rd.company, l.company, 'ln') as diff_score from rawData rd left join 
		lookup l on compged(rd.company, l.company, 'ln') < 400 order by id, 
		diff_score;
quit;

https://www.listendata.com/2016/03/fuzzy-matching-with-sas.html#:~:text=The%20COMPGED%20function%20r....

 

See modifiers (LN) definitions here, may want to add the i modifier.

https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.2/lefunctionsref/p1r4l9jwgatggtn1ko81fyjys4s7.h...

 

 

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 387 views
  • 0 likes
  • 3 in conversation