Re: Merging on a Partial/Prefix Key

CurtisSmithDCAA · Posted 05-19-2017 01:12 PM

I don't know why I have not encountered this before, but...

I need to merge two SAS data sets, both with the same key variable, but in the lookup data set the key may only contain the first few characters of the key in the master data set. For some rows, the key values in both data sets will exactly match. In other rows, the lookup data set may only be partial, but still exact for as many characters as are in the key value. When the lookup data set has only part of the full key value (essentially, a prefix), it will still be unique: that is, no two rows in the lookup data set will have the same key value. So, the merge would need to be similar to performing a WHERE statement using the LIKE operator.

I was thinking of pursuing fuzzy matching, but that process seems too impercise for my needs. I don't risk incorrect matching as long as I can match based on the first few characters of the key in the lookup data set.

Any ideas on the right direction to take?

art297 · Posted 05-19-2017 01:36 PM

Can you provide examples of the two data steps?

Art, CEO, AnalystFinder.com

CurtisSmithDCAA · Posted 05-19-2017 01:43 PM

The two data sets might look something like this:

Master data set:

Contract Dollars

ABC12345 1000

BCD98765 5000

XYZ99999 100

Lookup data set:

Contract Customer

ABC12345 Army

BCD Navy

XYZ99999 Air Force

An exact match using contract as the BY variable will merge contracts ABC12345 and XYZ99999, but not contract BCD. but I want to match it with customer "Navy".

art297 · Posted 05-19-2017 01:51 PM

Here is one way:

data master;
  input Contract $ Dollars;
  cards;
ABC12345        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD                 Navy
XYZ99999       Air Force
;

proc sql noprint;
  create table want as
    SELECT a.*, b.Customer
      FROM master a
       left JOIN lookup b
         ON a.Contract LIKE catt('%', b.Contract, '%') 
  ;
quit;

Art, CEO, AnalystFinder.com

CurtisSmithDCAA · Posted 05-19-2017 01:56 PM

Thanks, I had a hunch the answer would invlove sharpening my SQL skills. I will give this a whirl later today.

ballardw · Posted 05-19-2017 06:34 PM

If you search the Base SAS Programming board for "fuzzy match" you'll find a number of related posts and solutions.

Ksharp · Posted 05-19-2017 11:55 PM

Or Perl Regurlar Expression could you more power.

data master;
  input Contract $ Dollars;
  cards;
abc        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD                 Navy
XYZ99999       Air Force
;

proc sql ;
    SELECT a.*, b.Customer,b.Contract as b
      FROM master a
       inner JOIN lookup b
         ON prxmatch(cats('/^',a.Contract,'/i'),b.Contract) or 
            prxmatch(cats('/^',b.Contract,'/i'),a.Contract)
  ;
quit;

CurtisSmithDCAA · Posted 05-23-2017 09:53 AM

I used this method on my 12.4 million row master table. After 24 hours I stopped it. It processed about 1/4 of the master table, but does have a match for each row. I'm going to visually inspect the matching to see if it appears to be accurate. Then, I may subset the master table and run the process in pieces, unless anyone has an idea how to make this run more efficiently. I could try indexing both the master table and the lookup table, but I don't know if indexes matter to this matching process. Thanks.

Ksharp · Posted 05-23-2017 09:57 AM

If you have a big table, then it is not suited for SQL.

Hash Table + Data Step might be more efficient.

What size of these two tables ?

CurtisSmithDCAA · Posted 05-23-2017 10:05 AM

The master table is 12.4 million rows. The lookup table is about 1,000 rows.

##- Please type your reply above this line. Simple formatting, no
attachments. -##

Ksharp · Posted 05-23-2017 10:13 AM

Yeah. Try @art297 's SQL + LIKE firstly .

art297 · Posted 05-23-2017 10:01 AM

@CurtisSmithDCAA: Did you try the LIKE operator or the regular expression method? I would think that 12.4 Million records should not have posed any problem, in itself, for PROC SQL.

Art, CEO, AnalystFinder.com

Ksharp · Posted 05-23-2017 10:16 AM

Also need consider this scenario .

data master;
  input Contract $ Dollars;
  cards;
BCD        5000
;

data lookup;
  input Contract$ Customer $;
  cards;
XXBCDXX                Navy
;

CurtisSmithDCAA · Posted 05-23-2017 10:29 AM

The SQL with the LIKE operator did run after several hours but the resulting table had more rows than the master table. I have not had a chance to diagnose why. The method with the Perl Regurlar Expression seems to work without creating addiditonal rows, but at the rate it was running would take about 4 days to complete.

art297 · Posted 05-23-2017 10:45 AM

@CurtisSmithDCAA: I think you just have to add DISTINCT to your select statement. Otherwise, you'll get multiple rows if there are both partial and full matches. e.g.:

data master;
  input Contract $ Dollars;
  cards;
ABC12345        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD            Navy
BCD98765       Navy
XYZ99999       Air Force
;

proc sql noprint;
  create table want as
    SELECT distinct a.*, b.Customer
      FROM master a
       left JOIN lookup b
         ON a.Contract LIKE catt('%', b.Contract, '%') 
  ;
quit;

Art, CEO, AnalystFinder.com

Registration is open

SAS Training: Just a Click Away