Solved: Merging with Two but Imperfect Identifiers

Junyong · Posted 04-18-2019 03:15 PM

Think of two arbitrary panel data sets as follows.

Panel A Panel B

i j t x i j t y

1 10 1 0.54 1 10 1 0.32

1 10 2 0.55 1 10 2 0.50

1 10 3 0.90 1 10 3 0.71

2 20 1 0.03 2 . 1 0.22

2 20 2 0.52 2 . 2 0.69

2 20 3 0.22 2 . 3 0.81

3 30 1 0.18 3 30 1 0.83

3 30 2 0.14 3 30 2 0.81

3 30 3 0.46 3 30 3 0.51

. 40 1 0.21 4 40 1 0.14

. 40 2 0.19 4 40 2 0.42

. 40 3 0.74 4 40 3 0.95

5 50 1 0.10 5 50 1 0.38

5 50 2 0.14 5 50 2 0.45

5 50 3 0.59 5 50 3 0.39

As shown, two sets have three identifiers each—i and j for individuals, and t for time periods. Though both i and j capture the individuals, some are randomly missing. Fortunately, the second individual could be identified with i=2, and the fourth one could be with j=40, but what is the best way in SAS? For example, (1) DATA MERGE BY i will miss the fourth individual, (2) DATA MERGE BY j will miss the second one, and (3) DATA MERGE BY i j will miss both. As the real data sets are unbalanced and have more intermingled index structures, I cannot simply merge just with MERGE (without BY). Thanks for any help in advance.

Astounding · Posted 04-18-2019 04:19 PM

I would recommend splitting your data first:

data panel_a_both 
     panel_a_missing_i (drop=i)
     panel_a_missing_j (drop=j);
   set panel_a;
   if i=. then output panel_a_missing_i;
   else if j=. then output panel_a_missing_j;
   else output panel_a_both;
run;

If you do this for both PANEL_A and PANEL_B, then you can merge as needed. For example:

PANEL_A_BOTH + PANEL_B_BOTH by i j
PANEL_A_MISSING_I + PANEL_B_BOTH by j

You will need a few more merges to come up with all the necessary matches.

As long as one missing piece is available in the other data set, you should be able to retrieve it in this way. There is a danger, however, that the matches will be wrong. For example, is it possible that J=20 occurs not only for I=2, but also for some other value of I? How do you know what the right match is in that case?

View solution in original post

Reeza · Posted 04-18-2019 03:28 PM

Is there a way to deduce what the values should be before the merge and fix it?

Junyong · Posted 04-18-2019 03:43 PM

Two different sources provide some shared identifiers but neither is complete, so I need to do deduce in SAS. As it is the first time, I only have a blueprint that (1) IF i^=. AND j^=. MERGE BY i AND j, (2) ELSE IF i^=. AND j=. MERGE BY i, and (3) ELSE IF i=. AND j^=. MERGE BY j, leaving the rest with i=. and j=. unidentified, but wonder whether this approach is best.

cosmid · Posted 04-18-2019 03:48 PM

From your sample data, it looks like you can do it with a set instead of a merge.

data Panel_AB;

set PanelA;

set PanelB;

run;

Shmuel · Posted 04-18-2019 04:08 PM

It seems that i and j are synonyms.

Try next code:

data temp;
  merge A (where=(i  ^= .))
             B;
   by i t;
run;

data want;
 merge b (where =( j ^= .))
       temp;
  by j t;
run;

Astounding · Posted 04-18-2019 04:19 PM

I would recommend splitting your data first:

data panel_a_both 
     panel_a_missing_i (drop=i)
     panel_a_missing_j (drop=j);
   set panel_a;
   if i=. then output panel_a_missing_i;
   else if j=. then output panel_a_missing_j;
   else output panel_a_both;
run;

If you do this for both PANEL_A and PANEL_B, then you can merge as needed. For example:

PANEL_A_BOTH + PANEL_B_BOTH by i j
PANEL_A_MISSING_I + PANEL_B_BOTH by j

You will need a few more merges to come up with all the necessary matches.

As long as one missing piece is available in the other data set, you should be able to retrieve it in this way. There is a danger, however, that the matches will be wrong. For example, is it possible that J=20 occurs not only for I=2, but also for some other value of I? How do you know what the right match is in that case?

Junyong · Posted 04-18-2019 06:01 PM

The data provider says the identifiers are unique across individuals anyway, so I am assuming though did not double check.

PGStats · Posted 04-18-2019 04:34 PM

You could do this:

data pa;
input i     j     t     x;
datalines;
1    10     1  0.54         
1    10     2  0.55         
1    10     3  0.90         
2    20     1  0.03         
2    20     2  0.52         
2    20     3  0.22         
3    30     1  0.18         
3    30     2  0.14         
3    30     3  0.46         
.    40     1  0.21         
.    40     2  0.19         
.    40     3  0.74         
5    50     1  0.10         
5    50     2  0.14         
5    50     3  0.59      
;

data pb;
input i     j     t     y;
datalines;
 1    10     1  0.32
 1    10     2  0.50
 1    10     3  0.71
 2     .     1  0.22
 2     .     2  0.69
 2     .     3  0.81
 3    30     1  0.83
 3    30     2  0.81
 3    30     3  0.51
 4    40     1  0.14
 4    40     2  0.42
 4    40     3  0.95
 5    50     1  0.38
 5    50     2  0.45
 5    50     3  0.39
;

proc sql;
select * 
from
(select pa.*, y 
 from pa inner join 
      pb on pa.i=pb.i and pa.j=pb.j and pa.t=pb.t)
union all corr
(select pa.i, pa.t, coalesce(pa.j, pb.j) as j, x, y 
 from pa inner join 
      pb on pa.i=pb.i and pa.t=pb.t
 where cmiss(pa.j, pb.j) = 1)
union all corr
(select coalesce(pa.i, pb.i) as i, pa.j, pa.t, x, y 
 from pa inner join 
      pb on pa.j=pb.j and pa.t=pb.t
 where cmiss(pa.i, pb.i) = 1)
order by i, j, t;
quit;

this solution assumes that at least one of the two datasets contains both i and j, as provided in your test data.

PG

Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Re: Merging with Two but Imperfect Identifiers

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!