BookmarkSubscribeRSS Feed
CurtisSmithDCAA
Obsidian | Level 7

I don't know why I have not encountered this before, but...

I need to merge two SAS data sets, both with the same key variable, but in the lookup data set the key may only contain the first few characters of the key in the master data set. For some rows, the key values in both data sets will exactly match. In other rows, the lookup data set may only be partial, but still exact for as many characters as are in the key value. When the lookup data set has only part of the full key value (essentially, a prefix), it will still be unique: that is, no two rows in the lookup data set will have the same key value. So, the merge would need to be similar to performing a WHERE statement using the LIKE operator.

 

I was thinking of pursuing fuzzy matching, but that process seems too impercise for my needs. I don't risk incorrect matching as long as I can match based on the first few characters of the key in the lookup data set.

 

Any ideas on the right direction to take?

23 REPLIES 23
art297
Opal | Level 21

Can you provide examples of the two data steps?

 

Art, CEO, AnalystFinder.com

 

CurtisSmithDCAA
Obsidian | Level 7

The two data sets might look something like this:

 

Master data set:

 

Contract            Dollars

ABC12345        1000

BCD98765        5000

XYZ99999        100

 

Lookup data set:

 

Contract           Customer

ABC12345       Army

BCD                 Navy

XYZ99999       Air Force

 

An exact match using contract as the BY variable will merge contracts ABC12345 and XYZ99999, but not contract BCD. but I want to match it with customer "Navy".

art297
Opal | Level 21

Here is one way:

data master;
  input Contract $ Dollars;
  cards;
ABC12345        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD                 Navy
XYZ99999       Air Force
;

proc sql noprint;
  create table want as
    SELECT a.*, b.Customer
      FROM master a
       left JOIN lookup b
         ON a.Contract LIKE catt('%', b.Contract, '%') 
  ;
quit;

Art, CEO, AnalystFinder.com

 

CurtisSmithDCAA
Obsidian | Level 7

Thanks, I had a hunch the answer would invlove sharpening my SQL skills. I will give this a whirl later today.

ballardw
Super User

If you search the Base SAS Programming board for "fuzzy match" you'll find a number of related posts and solutions.

Ksharp
Super User

Or Perl Regurlar Expression could you more power.

 

data master;
  input Contract $ Dollars;
  cards;
abc        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD                 Navy
XYZ99999       Air Force
;

proc sql ;
    SELECT a.*, b.Customer,b.Contract as b
      FROM master a
       inner JOIN lookup b
         ON prxmatch(cats('/^',a.Contract,'/i'),b.Contract) or 
            prxmatch(cats('/^',b.Contract,'/i'),a.Contract)
  ;
quit;
CurtisSmithDCAA
Obsidian | Level 7

I used this method on my 12.4 million row master table. After 24 hours I stopped it. It processed about 1/4 of the master table, but does have a match for each row. I'm going to visually inspect the matching to see if it appears to be accurate. Then, I may subset the master table and run the process in pieces, unless anyone has an idea how to make this run more efficiently. I could try indexing both the master table and the lookup table, but I don't know if indexes matter to this matching process. Thanks.

Ksharp
Super User

If you have a big table, then it is not suited for SQL.

Hash Table + Data Step might be more efficient.

What size of these two tables ?

CurtisSmithDCAA
Obsidian | Level 7
The master table is 12.4 million rows. The lookup table is about 1,000 rows.

##- Please type your reply above this line. Simple formatting, no
attachments. -##
Ksharp
Super User

Yeah. Try @art297 's SQL + LIKE firstly .

art297
Opal | Level 21

@CurtisSmithDCAA: Did you try the LIKE operator or the regular expression method? I would think that 12.4 Million records should not have posed any problem, in itself, for PROC SQL.

 

Art, CEO, AnalystFinder.com

 

Ksharp
Super User

Also need consider this scenario .

 

data master;
  input Contract $ Dollars;
  cards;
BCD        5000
;

data lookup;
  input Contract$ Customer $;
  cards;
XXBCDXX                Navy
;

 

CurtisSmithDCAA
Obsidian | Level 7

The SQL with the LIKE operator did run after several hours but the resulting table had more rows than the master table. I have not had a chance to diagnose why. The method with the Perl Regurlar Expression seems to work without creating addiditonal rows, but at the rate it was running would take about 4 days to complete.

art297
Opal | Level 21

@CurtisSmithDCAA: I think you just have to add DISTINCT to your select statement. Otherwise, you'll get multiple rows if there are both partial and full matches. e.g.:

data master;
  input Contract $ Dollars;
  cards;
ABC12345        1000
BCD98765        5000
XYZ99999        100
;

data lookup;
  input Contract$ Customer $;
  cards;
ABC12345       Army
BCD            Navy
BCD98765       Navy
XYZ99999       Air Force
;

proc sql noprint;
  create table want as
    SELECT distinct a.*, b.Customer
      FROM master a
       left JOIN lookup b
         ON a.Contract LIKE catt('%', b.Contract, '%') 
  ;
quit;

Art, CEO, AnalystFinder.com

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 23 replies
  • 3370 views
  • 0 likes
  • 4 in conversation