BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
lawatkey
Obsidian | Level 7

To start, I have two sets of data that I would like to join as close to 1 to 1 as possible on 3 total criteria. 

1) A.FIRST_NAME = B.FIRST_NAME

2) A.WEIGHT=B.WEIGHT

3) 0<= INTCK('DAY',A.WEIGHT,B.WEIGHT)<12

 

If I were to write this in a program currently, it would look like this:

 

DATA WORK.TABLEA;
INPUT PERSON_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 17JAN2020
2 JOHN 205 05JAN2020
3 KEVIN 180 01FEB2020
4 PAUL 185 17JAN2020
5 PAUL 185 17JAN2020
6 JOHN 221 25JAN2020
7 BEN 176 13FEB2020
8 PAUL 185 21JAN2020
;
RUN;

DATA WORK.TABLEB;
INPUT HEIGHT_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 18JAN2020
2 PAUL 185 20JAN2020
3 KEVIN 190 02FEB2020
4 KEVIN 180 05FEB2020
5 JOHN 205 06JAN2020
6 PAUL 185 21JAN2020
7 BEN 176 15FEB2020
8 PAUL 185 23JAN2020
9 BEN 180 01MAR2020
10 JOHN 205 10JAN2020
;
RUN;

 

PROC SQL;

CREATE TABLE WORK.MERGED_TABLE AS SELECT 

A.*,

B.HEIGHT_ID,

INTCK('DAY',A.DATE,B.DATE) AS DATEDIF

FROM TABLEA A

LEFT JOIN TABLEB B

ON A.FIRST_NAME = B.FIRST_NAME

AND A.WEIGHT = B.WEIGHT

AND 0<= INTCK('DAY',A.DATE,B.DATE)<12

ORDER BY A.PERSON_ID, DATEDIF;

QUIT;

 

PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY PERSON_ID; RUN;

PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY HEIGHT_ID; RUN;

 

My above program doesn't give me the exact desired output even, since deduping on both IDs just ends up returning 5 values, but it would be something like that.

 

I essentially want SAS to choose the best join (based on lowest date difference with the other 2 criteria matching) and then stop evaluating BOTH lines (from table A and B) in any future joins.  In current state, it is evaluating every line in table A against every line in table B.  Even with an inner join and adding a step to left join back to Table A later, it is still extremely inefficient and sometimes just does not work.  I have not done true SAS merges before, and any example I have seen has not had a sort of "flexible" join criteria.  Thank you for any help!

 

image.png

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

OK, so now we have an algorithm that we can try to cast into code.

See this:

 

data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
  declare hash b (dataset:"tableb (rename=(date=b_date))");
  b.definekey("first_name","weight","b_date");
  b.definedata("height_id");
  b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
  b_date + 1;
  rc = b.find();
end;
if rc = 0
then do;
  datdif = b_date - date;
  rc = b.remove();
end;
drop rc b_date;
run;

The result is this:

 

 

We can't have height_id 6 in the last observation, as this was already used in observation 5.

View solution in original post

9 REPLIES 9
lawatkey
Obsidian | Level 7
I had a typo in the code I edited earlier, that was meant to be the date, not weight
Kurt_Bremser
Super User

You problem is that you do not have proper identification for observations. For "Paul", you have three absolutely identical observations in tablea. How would you then select anything but the same observation from tableb if you search for the closest one?

 

Or should you first go and deduplicate tablea by name, weight and date?

lawatkey
Obsidian | Level 7
The unique "key" here is the Person ID. I am showing this example specifically because my actual data is transaction based, where you can have someone making the same sort of transaction 5 times in a row (10 dollars at a grocery store, from the same card, 5 times in that day), so while the transactions can LOOK identical, they are unique, based on their special key, which is the Person_ID in this example.
Kurt_Bremser
Super User

Then you would need to join on this unique key, but the person_id and height_id do not match with regard to the names.

 

Basically, you have multiple Paul-185-17jan2020 entries in A, and multiple entries for Paul-185 in B where the date fulfills your condition, so you will invariably get a cartesian join if you cannot specify an additional condition.

lawatkey
Obsidian | Level 7

 Is there no way to do a "best case scenario" join for Person_ID 1 (if one exists), then stop evaluating Person_ID 1 and whatever Height ID it matched to?  In this case Height ID 1.  

 

So have SAS look at Person_ID1 and find a best match in all of table B, if one exists.  In this case it is Height_ID 1.

Next, have SAS look at Person_ID2 and find a best match in all of table B EXCEPT Height_ID1, if one exists.  

Repeat until done.

Kurt_Bremser
Super User

OK, so now we have an algorithm that we can try to cast into code.

See this:

 

data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
  declare hash b (dataset:"tableb (rename=(date=b_date))");
  b.definekey("first_name","weight","b_date");
  b.definedata("height_id");
  b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
  b_date + 1;
  rc = b.find();
end;
if rc = 0
then do;
  datdif = b_date - date;
  rc = b.remove();
end;
drop rc b_date;
run;

The result is this:

 

 

We can't have height_id 6 in the last observation, as this was already used in observation 5.

lawatkey
Obsidian | Level 7

I still have so much to learn about SAS... I can sort of understand the logic but it's nothing I've ever used before... I applied it to my real code and it worked wonderfully.  Not only did it work, but it is FAST, less than a minute to jam 2-300k rows into 120 million.  @Kurt_Bremser you are one smart guy, thank you so much!

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 911 views
  • 2 likes
  • 2 in conversation