I don't know why I have not encountered this before, but...
I need to merge two SAS data sets, both with the same key variable, but in the lookup data set the key may only contain the first few characters of the key in the master data set. For some rows, the key values in both data sets will exactly match. In other rows, the lookup data set may only be partial, but still exact for as many characters as are in the key value. When the lookup data set has only part of the full key value (essentially, a prefix), it will still be unique: that is, no two rows in the lookup data set will have the same key value. So, the merge would need to be similar to performing a WHERE statement using the LIKE operator.
I was thinking of pursuing fuzzy matching, but that process seems too impercise for my needs. I don't risk incorrect matching as long as I can match based on the first few characters of the key in the lookup data set.
Any ideas on the right direction to take?
Can you provide examples of the two data steps?
Art, CEO, AnalystFinder.com
The two data sets might look something like this:
Master data set:
Contract Dollars
ABC12345 1000
BCD98765 5000
XYZ99999 100
Lookup data set:
Contract Customer
ABC12345 Army
BCD Navy
XYZ99999 Air Force
An exact match using contract as the BY variable will merge contracts ABC12345 and XYZ99999, but not contract BCD. but I want to match it with customer "Navy".
Here is one way:
data master; input Contract $ Dollars; cards; ABC12345 1000 BCD98765 5000 XYZ99999 100 ; data lookup; input Contract$ Customer $; cards; ABC12345 Army BCD Navy XYZ99999 Air Force ; proc sql noprint; create table want as SELECT a.*, b.Customer FROM master a left JOIN lookup b ON a.Contract LIKE catt('%', b.Contract, '%') ; quit;
Art, CEO, AnalystFinder.com
Thanks, I had a hunch the answer would invlove sharpening my SQL skills. I will give this a whirl later today.
If you search the Base SAS Programming board for "fuzzy match" you'll find a number of related posts and solutions.
Or Perl Regurlar Expression could you more power.
data master;
input Contract $ Dollars;
cards;
abc 1000
BCD98765 5000
XYZ99999 100
;
data lookup;
input Contract$ Customer $;
cards;
ABC12345 Army
BCD Navy
XYZ99999 Air Force
;
proc sql ;
SELECT a.*, b.Customer,b.Contract as b
FROM master a
inner JOIN lookup b
ON prxmatch(cats('/^',a.Contract,'/i'),b.Contract) or
prxmatch(cats('/^',b.Contract,'/i'),a.Contract)
;
quit;
I used this method on my 12.4 million row master table. After 24 hours I stopped it. It processed about 1/4 of the master table, but does have a match for each row. I'm going to visually inspect the matching to see if it appears to be accurate. Then, I may subset the master table and run the process in pieces, unless anyone has an idea how to make this run more efficiently. I could try indexing both the master table and the lookup table, but I don't know if indexes matter to this matching process. Thanks.
If you have a big table, then it is not suited for SQL.
Hash Table + Data Step might be more efficient.
What size of these two tables ?
Yeah. Try @art297 's SQL + LIKE firstly .
@CurtisSmithDCAA: Did you try the LIKE operator or the regular expression method? I would think that 12.4 Million records should not have posed any problem, in itself, for PROC SQL.
Art, CEO, AnalystFinder.com
Also need consider this scenario .
data master;
input Contract $ Dollars;
cards;
BCD 5000
;
data lookup;
input Contract$ Customer $;
cards;
XXBCDXX Navy
;
The SQL with the LIKE operator did run after several hours but the resulting table had more rows than the master table. I have not had a chance to diagnose why. The method with the Perl Regurlar Expression seems to work without creating addiditonal rows, but at the rate it was running would take about 4 days to complete.
@CurtisSmithDCAA: I think you just have to add DISTINCT to your select statement. Otherwise, you'll get multiple rows if there are both partial and full matches. e.g.:
data master; input Contract $ Dollars; cards; ABC12345 1000 BCD98765 5000 XYZ99999 100 ; data lookup; input Contract$ Customer $; cards; ABC12345 Army BCD Navy BCD98765 Navy XYZ99999 Air Force ; proc sql noprint; create table want as SELECT distinct a.*, b.Customer FROM master a left JOIN lookup b ON a.Contract LIKE catt('%', b.Contract, '%') ; quit;
Art, CEO, AnalystFinder.com
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.