Solved: Re: Merging Two Tables with complex conditions

lawatkey · Posted 03-13-2020 08:10 AM

To start, I have two sets of data that I would like to join as close to 1 to 1 as possible on 3 total criteria.

1) A.FIRST_NAME = B.FIRST_NAME

2) A.WEIGHT=B.WEIGHT

3) 0<= INTCK('DAY',A.WEIGHT,B.WEIGHT)<12

If I were to write this in a program currently, it would look like this:

DATA WORK.TABLEA;
INPUT PERSON_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 17JAN2020
2 JOHN 205 05JAN2020
3 KEVIN 180 01FEB2020
4 PAUL 185 17JAN2020
5 PAUL 185 17JAN2020
6 JOHN 221 25JAN2020
7 BEN 176 13FEB2020
8 PAUL 185 21JAN2020
;
RUN;

DATA WORK.TABLEB;
INPUT HEIGHT_ID FIRST_NAME $ WEIGHT DATE DATE9.;
FORMAT DATE DATE9.;
CARDS;
1 PAUL 185 18JAN2020
2 PAUL 185 20JAN2020
3 KEVIN 190 02FEB2020
4 KEVIN 180 05FEB2020
5 JOHN 205 06JAN2020
6 PAUL 185 21JAN2020
7 BEN 176 15FEB2020
8 PAUL 185 23JAN2020
9 BEN 180 01MAR2020
10 JOHN 205 10JAN2020
;
RUN;

PROC SQL;

CREATE TABLE WORK.MERGED_TABLE AS SELECT

A.*,

B.HEIGHT_ID,

INTCK('DAY',A.DATE,B.DATE) AS DATEDIF

FROM TABLEA A

LEFT JOIN TABLEB B

ON A.FIRST_NAME = B.FIRST_NAME

AND A.WEIGHT = B.WEIGHT

AND 0<= INTCK('DAY',A.DATE,B.DATE)<12

ORDER BY A.PERSON_ID, DATEDIF;

QUIT;

PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY PERSON_ID; RUN;

PROC SORT DATA=WORK.MERGED_TABLE NODUPKEY; BY HEIGHT_ID; RUN;

My above program doesn't give me the exact desired output even, since deduping on both IDs just ends up returning 5 values, but it would be something like that.

I essentially want SAS to choose the best join (based on lowest date difference with the other 2 criteria matching) and then stop evaluating BOTH lines (from table A and B) in any future joins. In current state, it is evaluating every line in table A against every line in table B. Even with an inner join and adding a step to left join back to Table A later, it is still extremely inefficient and sometimes just does not work. I have not done true SAS merges before, and any example I have seen has not had a sort of "flexible" join criteria. Thank you for any help!

Kurt_Bremser · Posted 03-13-2020 10:59 AM

OK, so now we have an algorithm that we can try to cast into code.

See this:

data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
  declare hash b (dataset:"tableb (rename=(date=b_date))");
  b.definekey("first_name","weight","b_date");
  b.definedata("height_id");
  b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
  b_date + 1;
  rc = b.find();
end;
if rc = 0
then do;
  datdif = b_date - date;
  rc = b.remove();
end;
drop rc b_date;
run;

The result is this:

1

PAUL

185

17JAN2020

1

2

JOHN

205

05JAN2020

5

1

3

KEVIN

180

01FEB2020

4

PAUL

185

17JAN2020

2

3

5

PAUL

185

17JAN2020

6

4

6

JOHN

221

25JAN2020

.

7

BEN

176

13FEB2020

7

2

8

PAUL

185

21JAN2020

8

2

We can't have height_id 6 in the last observation, as this was already used in observation 5.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 03-13-2020 09:12 AM

WEIGHT is not a date, so using intck() on it does not make sense.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lawatkey · Posted 03-13-2020 09:29 AM

I had a typo in the code I edited earlier, that was meant to be the date, not weight

Kurt_Bremser · Posted 03-13-2020 09:54 AM

You problem is that you do not have proper identification for observations. For "Paul", you have three absolutely identical observations in tablea. How would you then select anything but the same observation from tableb if you search for the closest one?

Or should you first go and deduplicate tablea by name, weight and date?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lawatkey · Posted 03-13-2020 10:00 AM

The unique "key" here is the Person ID. I am showing this example specifically because my actual data is transaction based, where you can have someone making the same sort of transaction 5 times in a row (10 dollars at a grocery store, from the same card, 5 times in that day), so while the transactions can LOOK identical, they are unique, based on their special key, which is the Person_ID in this example.

Kurt_Bremser · Posted 03-13-2020 10:16 AM

Then you would need to join on this unique key, but the person_id and height_id do not match with regard to the names.

Basically, you have multiple Paul-185-17jan2020 entries in A, and multiple entries for Paul-185 in B where the date fulfills your condition, so you will invariably get a cartesian join if you cannot specify an additional condition.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lawatkey · Posted 03-13-2020 10:30 AM

Is there no way to do a "best case scenario" join for Person_ID 1 (if one exists), then stop evaluating Person_ID 1 and whatever Height ID it matched to? In this case Height ID 1.

So have SAS look at Person_ID1 and find a best match in all of table B, if one exists. In this case it is Height_ID 1.

Next, have SAS look at Person_ID2 and find a best match in all of table B EXCEPT Height_ID1, if one exists.

Repeat until done.

Kurt_Bremser · Posted 03-13-2020 10:59 AM

OK, so now we have an algorithm that we can try to cast into code.

See this:

data want;
set tablea;
height_id = .;
if _n_ = 1
then do;
  declare hash b (dataset:"tableb (rename=(date=b_date))");
  b.definekey("first_name","weight","b_date");
  b.definedata("height_id");
  b.definedone();
end;
b_date = date;
rc = b.find();
do while (rc ne 0 and b_date < date + 12);
  b_date + 1;
  rc = b.find();
end;
if rc = 0
then do;
  datdif = b_date - date;
  rc = b.remove();
end;
drop rc b_date;
run;

The result is this:

1

PAUL

185

17JAN2020

1

2

JOHN

205

05JAN2020

5

1

3

KEVIN

180

01FEB2020

4

PAUL

185

17JAN2020

2

3

5

PAUL

185

17JAN2020

6

4

6

JOHN

221

25JAN2020

.

7

BEN

176

13FEB2020

7

2

8

PAUL

185

21JAN2020

8

2

We can't have height_id 6 in the last observation, as this was already used in observation 5.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

lawatkey · Posted 03-13-2020 12:24 PM

I still have so much to learn about SAS... I can sort of understand the logic but it's nothing I've ever used before... I applied it to my real code and it worked wonderfully. Not only did it work, but it is FAST, less than a minute to jam 2-300k rows into 120 million. @Kurt_Bremser you are one smart guy, thank you so much!

Kurt_Bremser · Posted 03-13-2020 01:04 PM

The hash object is one of the most useful tools SAS has ever given us; it is worth every minute of learning you put into it.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?