DATA Step, Macro, Functions and more

Merge datasets based on nearest date and group

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 98
Accepted Solution

Merge datasets based on nearest date and group

Hi everyone,

I have a very specific match/merging problem that I have been unable to solve on my own, or find a ready solution to searching the web. Maybe I'm just not thinking about the problem the right way, but here are some example datasets to show what I have and what I want.

Here are the two datasets that I have:

DATA have_a;

input id year_1 group1 $;

datalines;

1 2002       a

2 2001       a

3 1997       a

4 1995       a

5 1991       a

6 1989       b

7 1988       b

8 1996       b

9 1989       b

10 2003      b

11 2002      c

12 2000      c

13 1987      c

14 2001      c

15 2002      c

;;;;

run;


DATA have_b;

input year_2 score group2 $;

datalines;

1999 5 a

1999 2 b

1999 7 c

2004 1 a

2004 4 b

2004 3 c

;;;;

run;

My actual datasets are much larger than this, but are in this basic format. So the first dataset (have_a) has a bunch of individual records indexed by an individual ID, a year, and a group assignment. The second dataset (have_b) is group data that needs to be merged into the first dataset; there is a "score" associated with each group and a specific year. The years in this second dataset are not the same as any of the years in the other dataset.

What I need is to merge the information from the second dataset into the first dataset, in such a way that the score associated with a specific group and the year closest to an individual's recorded year is associated with their ID. To put it another way, each individual has a year and a group assignment; I want that individual to also have the score for that group at the closest year a score was recorded for that group. Here is an example dataset of what I want after the merge:

DATA want;

input id year_1 year_2 score group $;

datalines;

1 2002   2004  1  a

2 2001   1999  5  a

3 1997   1999  5  a

4 2005   2004  1  a

5 1991   1999  5  a

6 1989   1999  2  b

7 2002   2004  4  b

8 1996   1999  2  b

9 1989   1999  2  b

10 2003  2004  4  b

11 2002  2004  3  c

12 2000  1999  7  c

13 1987  1999  7  c

14 2001  1999  7  c

15 2002  2004  3  c

;;;;

run;

So you can see that each the dataset maintains the basic format of the first dataset (have_a). But now, each individual has the closest group year and that score (from have_b) associated with its ID. For example, for individual 1, their recorded year was 2002 and their group assignment was A. So, the information from have_b that I need attached to individual 1 is the year 2004 (because it is closer to 2002 than 1999) and the score for group A in 2004 (which you can see is 1 in have_b).  Individual 2 had the recorded year 2001 and assigned to group A, so they get the score for group A in 1999 which is 5. And so on and so forth. Make sense?

I am still undecided what to do in case of ties. For now, I am operating in the case that if there is any tie, the "older" year takes preference. That is, if the two group years are 2000 and 2005, and the individual year is 2003, we would go with 2000 even though both are equidistant.

Does anybody have any idea how to proceed? I am at a loss. I have fooled around with a few different solutions, but none of them have come particularly close to doing what I need.


Accepted Solutions
Solution
‎07-31-2015 05:54 PM
Super User
Posts: 11,343

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

The ID in your WANT example doesn't match the starting have_a dataset. You have ID 4 in have_a associated with year1 of 1995 but the WANT data has 2005. Is that the correct "want"? Similar change for ID 7,

Ignoring the want ID values this seems to do some of what you want:

proc sql;
   create table temp as
   select a.*, b.year_2, b.score
   from have_a as a join have_b as b
      on a.group1=b.group2
   order by id,group1, year_1,abs(a.year_1 - b.year_2)
   ;
quit;

data want;
   set temp;
   by id group1 year_1;
   if first.year_1;
run;

View solution in original post


All Replies
Super User
Posts: 19,787

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

There are two ways - one would be to create a value for every year based on some rule so that you could do a year to year merge.

The second is to merge using a SQL join, do you have a window that you want to look in, ie is it okay for a 2009 year to take a 1987 value?

Solution
‎07-31-2015 05:54 PM
Super User
Posts: 11,343

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

The ID in your WANT example doesn't match the starting have_a dataset. You have ID 4 in have_a associated with year1 of 1995 but the WANT data has 2005. Is that the correct "want"? Similar change for ID 7,

Ignoring the want ID values this seems to do some of what you want:

proc sql;
   create table temp as
   select a.*, b.year_2, b.score
   from have_a as a join have_b as b
      on a.group1=b.group2
   order by id,group1, year_1,abs(a.year_1 - b.year_2)
   ;
quit;

data want;
   set temp;
   by id group1 year_1;
   if first.year_1;
run;

Frequent Contributor
Posts: 98

Re: Merge datasets based on nearest date and group

Yup, the ID differences were typos. I entered in those example datalines manually, because my work computer has Internet Explorer, which for some reason just can't sensibly interact with these forums in terms of copy-pasting SAS code.

Anyway, thank you for all of the different recommendations everyone! I am going to test them all out and see which one works best for my needs and then accept an answer.

Frequent Contributor
Posts: 98

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

I decided to accept the answer given by ballardw. It does what I need it to do, and I think makes the most sense for my context overall. I marked the answer by xia keshan as a helpful answer; it also accomplishes the same goal, by making use of hash objects instead of SQL. I admit that I know very little about hash objects in general; it is faster (0.10 seconds compared to 0.18 for the PROC SQL/DATA step combination), but I am not sure if it necessarily makes sense to create a hash object for my particular case (my impression is that they are most useful when you want to re-reference that hash object multiple times in both PROCS and DATA steps, though I may be mistaken). Feel free to correct me if I am wrong, here, for those who are more knowledgeable on the topic.

EDIT: As an interesting though trivial sidenote, using the PROC SQL/DATA step combination provided by ballardw results in the new values being added to the right side of the dataset. That is, the columns are added to the end. Using the hash object provided by xia keshan, however, the columns are added to the beginning (on the left) side of the dataset. Ultimately, it makes no difference at all, but it is an interesting curiosity.

The solution provided by Chris@NewZealand did not work for me. Since two working answers were provided, I don't currently feel the need to look into it and debug exactly why; however, his code when applied to my data created a dataset with 45,782 rows, instead of the correct 44,608. It isn't immediately obvious to me why, though I suspect the answer may have to do with some ties so that multiple correct matches are being included for certain observations, when I only want one "correct" match per observation.

In any case, thank you everybody for your help!

PROC Star
Posts: 1,759

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

Yes, if you don't want the duplicates and don't care which ones you keep, you can run a proc sort nodupkey after proc sql, or you can run a data step like in the solution you chose.
Thank you in any case for the good etiquette of replying, and closing your question.

Super User
Posts: 10,023

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

You mean the minimize abs(year_1-year_2)  ?

Code: Program

DATA have_a;
input id year_1 group1 $;
datalines;
1 2002 a
2 2001 a
3 1997 a
4 1995 a
5 1991 a
6 1989 b
7 1988 b
8 1996 b
9 1989 b
10 2003 b
11 2002 c
12 2000 c
13 1987 c
14 2001 c
15 2002 c
;;;;
run;


DATA have_b;
input year_2 score group2 $;
datalines;
1999 5 a
1999 2 b
1999 7 c
2004 1 a
2004 4 b
2004 3 c
;;;;
run;
data want;
if _n_ eq 1 then do;
if 0 then set have_b;
  declare hash ha(dataset:'have_b',hashexp:20);
  ha.definekey('year_2','group2');
  ha.definedata('score');
  ha.definedone();
end;
call missing(of _all_);
set have_a;
do i=0 to 100;
year_2=year_1+i;group2=group1;
rc=ha.find();
if rc=0 then leave;

year_2=year_1-i;group2=group1;
rc=ha.find();
if rc=0 then leave;
end;
if rc ne 0 then year_2=.;
drop i rc group2;
run;

PROC Star
Posts: 1,759

Re: Merge datasets based on nearest date and group

Posted in reply to RyanSimmons

The SAS colour parser struggles here.

proc sql;

  create table WANT as

  select  YEAR_2, SCORE, ID, YEAR_1, GROUP1

  from HAVE_A, HAVE_B

  where GROUP2=GROUP1

  group by ID

having min(abs(YEAR_2 - YEAR_1)) =  abs(YEAR_2 - YEAR_1)

  order by ID ;

quit;   


One example where oracle's rank() function would be useful to have in SAS.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 418 views
  • 6 likes
  • 5 in conversation