BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

Think of two arbitrary panel data sets as follows.

 

Panel A                      Panel B

i     j     t     x          i     j     t     y

1    10     1  0.54          1    10     1  0.32

1    10     2  0.55          1    10     2  0.50

1    10     3  0.90          1    10     3  0.71

2    20     1  0.03          2     .     1  0.22

2    20     2  0.52          2     .     2  0.69

2    20     3  0.22          2     .     3  0.81

3    30     1  0.18          3    30     1  0.83

3    30     2  0.14          3    30     2  0.81

3    30     3  0.46          3    30     3  0.51

.    40     1  0.21          4    40     1  0.14

.    40     2  0.19          4    40     2  0.42

.    40     3  0.74          4    40     3  0.95

5    50     1  0.10          5    50     1  0.38

5    50     2  0.14          5    50     2  0.45

5    50     3  0.59          5    50     3  0.39

 

As shown, two sets have three identifiers each—i and j for individuals, and t for time periods. Though both i and j capture the individuals, some are randomly missing. Fortunately, the second individual could be identified with i=2, and the fourth one could be with j=40, but what is the best way in SAS? For example, (1) DATA MERGE BY i will miss the fourth individual, (2) DATA MERGE BY j will miss the second one, and (3) DATA MERGE BY i j will miss both. As the real data sets are unbalanced and have more intermingled index structures, I cannot simply merge just with MERGE (without BY). Thanks for any help in advance.

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

I would recommend splitting your data first:

 

data panel_a_both 
     panel_a_missing_i (drop=i)
     panel_a_missing_j (drop=j);
   set panel_a;
   if i=. then output panel_a_missing_i;
   else if j=. then output panel_a_missing_j;
   else output panel_a_both;
run;

If you do this for both PANEL_A and PANEL_B, then you can merge as needed.  For example:

 

  • PANEL_A_BOTH + PANEL_B_BOTH by i j
  • PANEL_A_MISSING_I + PANEL_B_BOTH by j

You will need a few more merges to come up with all the necessary matches.

 

As long as one missing piece is available in the other data set, you should be able to retrieve it in this way.  There is a danger, however, that the matches will be wrong.  For example, is it possible that J=20 occurs not only for I=2, but also for some other value of I?  How do you know what the right match is in that case?

View solution in original post

7 REPLIES 7
Reeza
Super User
Is there a way to deduce what the values should be before the merge and fix it?
Junyong
Pyrite | Level 9
Two different sources provide some shared identifiers but neither is complete, so I need to do deduce in SAS. As it is the first time, I only have a blueprint that (1) IF i^=. AND j^=. MERGE BY i AND j, (2) ELSE IF i^=. AND j=. MERGE BY i, and (3) ELSE IF i=. AND j^=. MERGE BY j, leaving the rest with i=. and j=. unidentified, but wonder whether this approach is best.
cosmid
Lapis Lazuli | Level 10

From your sample data, it looks like you can do it with a set instead of a merge.

 

data Panel_AB;

 set PanelA;

 set PanelB;

run;

Shmuel
Garnet | Level 18

It seems that i and j are synonyms.

Try next code:

data temp;
  merge A (where=(i  ^= .))
             B;
   by i t;
run;

data want;
 merge b (where =( j ^= .))
       temp;
  by j t;
run;
Astounding
PROC Star

I would recommend splitting your data first:

 

data panel_a_both 
     panel_a_missing_i (drop=i)
     panel_a_missing_j (drop=j);
   set panel_a;
   if i=. then output panel_a_missing_i;
   else if j=. then output panel_a_missing_j;
   else output panel_a_both;
run;

If you do this for both PANEL_A and PANEL_B, then you can merge as needed.  For example:

 

  • PANEL_A_BOTH + PANEL_B_BOTH by i j
  • PANEL_A_MISSING_I + PANEL_B_BOTH by j

You will need a few more merges to come up with all the necessary matches.

 

As long as one missing piece is available in the other data set, you should be able to retrieve it in this way.  There is a danger, however, that the matches will be wrong.  For example, is it possible that J=20 occurs not only for I=2, but also for some other value of I?  How do you know what the right match is in that case?

Junyong
Pyrite | Level 9
The data provider says the identifiers are unique across individuals anyway, so I am assuming though did not double check.
PGStats
Opal | Level 21

You could do this:

 

data pa;
input i     j     t     x;
datalines;
1    10     1  0.54         
1    10     2  0.55         
1    10     3  0.90         
2    20     1  0.03         
2    20     2  0.52         
2    20     3  0.22         
3    30     1  0.18         
3    30     2  0.14         
3    30     3  0.46         
.    40     1  0.21         
.    40     2  0.19         
.    40     3  0.74         
5    50     1  0.10         
5    50     2  0.14         
5    50     3  0.59      
;

data pb;
input i     j     t     y;
datalines;
 1    10     1  0.32
 1    10     2  0.50
 1    10     3  0.71
 2     .     1  0.22
 2     .     2  0.69
 2     .     3  0.81
 3    30     1  0.83
 3    30     2  0.81
 3    30     3  0.51
 4    40     1  0.14
 4    40     2  0.42
 4    40     3  0.95
 5    50     1  0.38
 5    50     2  0.45
 5    50     3  0.39
;

proc sql;
select * 
from
(select pa.*, y 
 from pa inner join 
      pb on pa.i=pb.i and pa.j=pb.j and pa.t=pb.t)
union all corr
(select pa.i, pa.t, coalesce(pa.j, pb.j) as j, x, y 
 from pa inner join 
      pb on pa.i=pb.i and pa.t=pb.t
 where cmiss(pa.j, pb.j) = 1)
union all corr
(select coalesce(pa.i, pb.i) as i, pa.j, pa.t, x, y 
 from pa inner join 
      pb on pa.j=pb.j and pa.t=pb.t
 where cmiss(pa.i, pb.i) = 1)
order by i, j, t;
quit;

this solution assumes that at least one of the two datasets contains both i and j, as provided in your test data.

PG

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 814 views
  • 1 like
  • 6 in conversation