To start, I have two sets of data that I would like to join as close to 1 to 1 as possible on 3 total criteria.
1) A.FIRST_NAME = B.FIRST_NAME
2) A.WEIGHT=B.WEIGHT
3) 0<= INTCK('DAY',A.WEIGHT,B.WEIGHT)<12
If I were to write this in a program currently, it would look like this:
DATA WORK.TABLEA;
INPUT PERSON_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 17JAN2020
2 JOHN 205 05JAN2020
3 KEVIN 180 01FEB2020
4 PAUL 185 17JAN2020
5 PAUL 185 17JAN2020
6 JOHN 221 25JAN2020
7 BEN 176 13FEB2020
8 PAUL 185 21JAN2020
;
RUN;
DATA WORK.TABLEB;
INPUT HEIGHT_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 18JAN2020
2 PAUL 185 20JAN2020
3 KEVIN 190 02FEB2020
4 KEVIN 180 05FEB2020
5 JOHN 205 06JAN2020
6 PAUL 185 21JAN2020
7 BEN 176 15FEB2020
8 PAUL 185 23JAN2020
9 BEN 180 01MAR2020
10 JOHN 205 10JAN2020
;
RUN;
PROC SQL;
CREATE TABLE WORK.MERGED_TABLE AS SELECT
A.*,
B.HEIGHT_ID,
INTCK('DAY',A.DATE,B.DATE) AS DATEDIF
FROM TABLEA A
LEFT JOIN TABLEB B
ON A.FIRST_NAME = B.FIRST_NAME
AND A.WEIGHT = B.WEIGHT
AND 0<= INTCK('DAY',A.DATE,B.DATE)<12
ORDER BY A.PERSON_ID, DATEDIF;
QUIT;
PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY PERSON_ID; RUN;
PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY HEIGHT_ID; RUN;
My above program doesn't give me the exact desired output even, since deduping on both IDs just ends up returning 5 values, but it would be something like that.
I essentially want SAS to choose the best join (based on lowest date difference with the other 2 criteria matching) and then stop evaluating BOTH lines (from table A and B) in any future joins. In current state, it is evaluating every line in table A against every line in table B. Even with an inner join and adding a step to left join back to Table A later, it is still extremely inefficient and sometimes just does not work. I have not done true SAS merges before, and any example I have seen has not had a sort of "flexible" join criteria. Thank you for any help!
OK, so now we have an algorithm that we can try to cast into code.
See this:
data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
declare hash b (dataset:"tableb (rename=(date=b_date))");
b.definekey("first_name","weight","b_date");
b.definedata("height_id");
b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
b_date + 1;
rc = b.find();
end;
if rc = 0
then do;
datdif = b_date - date;
rc = b.remove();
end;
drop rc b_date;
run;
The result is this:
1 | 1 | PAUL | 185 | 17JAN2020 | 1 | 1 |
2 | 2 | JOHN | 205 | 05JAN2020 | 5 | 1 |
3 | 3 | KEVIN | 180 | 01FEB2020 | 4 | 4 |
4 | 4 | PAUL | 185 | 17JAN2020 | 2 | 3 |
5 | 5 | PAUL | 185 | 17JAN2020 | 6 | 4 |
6 | 6 | JOHN | 221 | 25JAN2020 | . | . |
7 | 7 | BEN | 176 | 13FEB2020 | 7 | 2 |
8 | 8 | PAUL | 185 | 21JAN2020 | 8 | 2 |
We can't have height_id 6 in the last observation, as this was already used in observation 5.
WEIGHT is not a date, so using intck() on it does not make sense.
You problem is that you do not have proper identification for observations. For "Paul", you have three absolutely identical observations in tablea. How would you then select anything but the same observation from tableb if you search for the closest one?
Or should you first go and deduplicate tablea by name, weight and date?
Then you would need to join on this unique key, but the person_id and height_id do not match with regard to the names.
Basically, you have multiple Paul-185-17jan2020 entries in A, and multiple entries for Paul-185 in B where the date fulfills your condition, so you will invariably get a cartesian join if you cannot specify an additional condition.
Is there no way to do a "best case scenario" join for Person_ID 1 (if one exists), then stop evaluating Person_ID 1 and whatever Height ID it matched to? In this case Height ID 1.
So have SAS look at Person_ID1 and find a best match in all of table B, if one exists. In this case it is Height_ID 1.
Next, have SAS look at Person_ID2 and find a best match in all of table B EXCEPT Height_ID1, if one exists.
Repeat until done.
OK, so now we have an algorithm that we can try to cast into code.
See this:
data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
declare hash b (dataset:"tableb (rename=(date=b_date))");
b.definekey("first_name","weight","b_date");
b.definedata("height_id");
b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
b_date + 1;
rc = b.find();
end;
if rc = 0
then do;
datdif = b_date - date;
rc = b.remove();
end;
drop rc b_date;
run;
The result is this:
1 | 1 | PAUL | 185 | 17JAN2020 | 1 | 1 |
2 | 2 | JOHN | 205 | 05JAN2020 | 5 | 1 |
3 | 3 | KEVIN | 180 | 01FEB2020 | 4 | 4 |
4 | 4 | PAUL | 185 | 17JAN2020 | 2 | 3 |
5 | 5 | PAUL | 185 | 17JAN2020 | 6 | 4 |
6 | 6 | JOHN | 221 | 25JAN2020 | . | . |
7 | 7 | BEN | 176 | 13FEB2020 | 7 | 2 |
8 | 8 | PAUL | 185 | 21JAN2020 | 8 | 2 |
We can't have height_id 6 in the last observation, as this was already used in observation 5.
I still have so much to learn about SAS... I can sort of understand the logic but it's nothing I've ever used before... I applied it to my real code and it worked wonderfully. Not only did it work, but it is FAST, less than a minute to jam 2-300k rows into 120 million. @Kurt_Bremser you are one smart guy, thank you so much!
The hash object is one of the most useful tools SAS has ever given us; it is worth every minute of learning you put into it.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.