Merge many to one by zipcode for two zipcode variables

j_b_s · Posted 02-01-2024 12:29 AM

I am trying to merge two datasets where the first data set contains two zip codes for each individual--their home zip code and their work zip code. The second data set relates each possible zip code to two variables (i.e. RUCA1 and RUCA2). I want to end up with a dataset with variables for RUCA1 and RUCA2 matched on home zip code and RUCA1 and RUCA2 matched on work zip code--creating a total of 4 RUCA variables.

dataset1

id ziphome zipwork

1 a a

2 b c

3 c b

4 d

5 e a

dataset2

zip RUCA1 RUCA2

a 10 10

b 9 9.2

c 8 8.6

d 1 1.1

e 10 10

What I want to end up with in dataset3 is two sets of RUCA1 and RUCA2 variables created for each zipcode (home and work)--matched by zip code. If I merge using a BY statement, my zip code variables won't match between data sets (i.e. ziphome in dataset1 and zip in dataset2).

dataset3

ziphome zipwork RUCA1home RUCA2home RUCA1work RUCA2work

a a 10 10 10 10

b c 9 9.2 8 8.6

c b 8 8.6 9 9.2

d 1 1.1

e a 10 10 10 10

In dataset3, ziphome and zipwork would be retained from dataset1 and RUCA1home and RUCA2home would be created by matching ziphome with zip, and RUCA1work and RUCA2work would be created by matching zipwork with zip.

A clumsy solution would be to duplicate the variable ziphome to a variable named zip in dataset1. Do the merge. Rename RUCA1 and RUCA2 to RUCA1home and RUCA2home. Then drop zip. Then duplicate zipwork to zip in dataset3. Do the merge. Then drop zip from dataset4. And rename RUCA1 and RUCA2 to RUCAwork1 and RUCA2work. But I feel like there should be a cleaner process.

Patrick · Posted 02-01-2024 02:19 AM

Below two options. Option 2 should perform better because it doesn't require any sorting.

data have1;
  infile datalines truncover;
  input id (ziphome zipwork) ($);
  datalines;
1 a a
2 b c
3 c b
4 d 
5 e a
;

data have2;
  infile datalines truncover;
  input zip $ RUCA1 RUCA2;
  datalines;
a 10 10
b 9 9.2
c 8 8.6
d 1 1.1
e 10 10
;

/* option 1 */
proc sql;
  create table want as
  select
    l.id,
    l.ziphome,
    l.zipwork,
    rh.ruca1 as ruca1home,
    rh.ruca2 as ruca2home,
    rw.ruca1 as ruca1work,
    rw.ruca2 as ruca2work

  from have1 l 

  left join have2 rh
  on l.ziphome=rh.zip

  left join have2 rw
  on l.zipwork=rw.zip

  order by l.id
  ;
quit;

/* option 2 */
data want;
  set have1;
  if _n_=1 then
    do;
      if 0 then set have2(keep=zip ruca1 ruca2);
      dcl hash h1(dataset:"have2");
      h1.defineKey("zip");
      h1.defineData("ruca1","ruca2");
      h1.defineDone();
    end;
  if h1.find(key:ziphome)=0 then
    do;
      ruca1home=ruca1;
      ruca2home=ruca2;
    end;
  if h1.find(key:zipwork)=0 then
    do;
      ruca1work=ruca1;
      ruca2work=ruca2;
    end;
  drop zip ruca1 ruca2;
run;

Merge many to one by zipcode for two zipcode variables

Re: Merge many to one by zipcode for two zipcode variables

Registration is open