DATA Step, Macro, Functions and more

Creating a dataset through joins on several ID variables

Reply
Frequent Contributor
Posts: 138

Creating a dataset through joins on several ID variables

Hi,

 

I have two datasets that I am trying to join to one another. The end goal would either be (preferably) a single dataset with all people from Dataset 1 and Dataset 2 with a variable (or variables) telling me which dataset(s) each person was found in, or separate datasets for 1. People found in both Dataset 1 and Dataset 2, 2. People found in just Dataset 1, 3. People found in just Dataset 2.

 

There are two tricky parts: One is that there are several different ID variables on which the datasets could be joined (which all need to be tried as no one, or even combination of two, can cover all the possible links that can be made). The other is that while Dataset 2 is at the person level, Dataset 1 is at the person/group level, and a row from Dataset 1 should only be joined to Dataset 2 if the group in Dataset 2 matches the group in Dataset 1. 

 

For example:

 

Dataset 1:

ID1   ID2    ID3   ID4     ID5     Group.........categ_vars_from_dataset_1

1        .         .       .          .           A                             abc

1        .         .       .          .           B                             def

 .       2         .       .          .           A                             ghi     

 

 

 

 

Dataset 2:

ID1     ID2    ID3   ID4     ID5      Group.........categ_vars_from_dataset_2

1           .        .        .         .             A                         xyz

.           2        .        .         .             A                          jkl

 

 

Desired outcome dataset:

 

ID1   ID2    ID3   ID4     ID5     Group categ_vars_from_dataset_1   categ_vars_from_dataset_2      person_source_datasets

1        .         .       .          .           A                      abc                                          xyz                                       1 and 2

1        .         .       .          .           B                      def                                            .                                               1

.        2         .       .          .           A                      ghi                                           jkl                                          1 and 2

 

The first row of Dataset 1 can be joined to Dataset2 on ID1 and Group, so it has categorical vars from both datasets and the new variable person_source_datasets='1 and 2'. The second row of Dataset1 cannot be joined on any of the ID variables (they have a value for ID1 but their value for Group (B) is not found in the second dataset, so their row in the desired outcome dataset only has the categ vars from Dataset1 and their source dataset variable indicates that they come from Dataset 1 only. The third row of Dataset1 can be joined to Dataset2 on ID2 and Group, so it has categorical vars from both datasets.

 

I do need to try to join on every possible ID, though the order doesn't matter much (except that ID1 should be first as is it is the most populated). As I said, if it's easier to output separate datasets rather than having one summary dataset at the end, that's fine. Most people will have values for more than one ID variable (they can have values for all 5). 

 

Any help is much appreciated. 

Super User
Posts: 10,466

Re: Creating a dataset through joins on several ID variables

There may be a slick way with SQL for this but sometimes a simple process is easier to keep track of. If I understand what you are attempting:

Proc sort data=dataset1; by id group; run;
Proc sort data=dataset2; by id group; run;

data MatchOn1 NoMatch1
   merge
      dataset1 (in=in1)
      dataset2 (in=in2)
   ;
   by id group;
   if In1 and In2 then Match1=1;
   if Match1 then output MatchOn1;
   else output NoMatch1;
run;

data MatchOn2 NoMatch2
   merge
      NoMatch1 (in=in1)
      dataset2 (in=in2)
   ;
   by id group;
   if In1 and In2 then Match2=1;
   if Match2 then output MatchOn2;
   else output NoMatch2;
end;
/* continue for 5 times ...*/
data want;
   set
     MatchOn1 MatchOn2 MatchOn3 MatchOn4 MatchOn5
     NoMatch5 (in=InNoMatch)
   ;
   If InNoMatch then OnlyInSetOne=1;
run;


The MatchOn1 flag would mean the first Id Matches on so on. The OnlyInSetOne=1 would be those that were only in dataset1.

 

You use a single MatchOn variable with different values is so desired.

Trusted Advisor
Posts: 1,115

Re: Creating a dataset through joins on several ID variables

[ Edited ]

Here is a (not-really-slick) PROC SQL approach:

data have1;
input ID1 ID2 ID3 ID4 ID5 Group $ categ_vars_from_dataset_1 $;
cards;
1 . . . . A abc
1 . . . . B def
. 2 . . . A ghi 
;
 
data have2;
input ID1 ID2 ID3 ID4 ID5 Group $ categ_vars_from_dataset_2 $;
cards;
1 . . . . A xyz
. 2 . . . A jkl
. 3 . . . B pqr
;

proc sql;
create table want as
select coalesce(a.id1, b.id1) as id1,
       coalesce(a.id2, b.id2) as id2,
       coalesce(a.id3, b.id3) as id3,
       coalesce(a.id4, b.id4) as id4,
       coalesce(a.id5, b.id5) as id5,
       coalesce(a.group, b.group) as group,
       categ_vars_from_dataset_1, categ_vars_from_dataset_2, 
       catx(' and ', a.source, b.source)
         as person_source_datasets
from (select *, '1' as source from have1) a
     full join
     (select *, '2' as source from have2) b
on   a.group=b.group &
     0 < (cmiss(a.id1, b.id1)=0)*(9-8*(a.id1=b.id1))      /* Each of the five summands is either          */
        +(cmiss(a.id2, b.id2)=0)*(9-8*(a.id2=b.id2))      /* (0 if one of the corresp. IDs is missing) or */
        +(cmiss(a.id3, b.id3)=0)*(9-8*(a.id3=b.id3))      /* (1 if they are non-missing and equal) or     */
        +(cmiss(a.id4, b.id4)=0)*(9-8*(a.id4=b.id4))      /* (9 if they are non-missing and unequal).     */
        +(cmiss(a.id5, b.id5)=0)*(9-8*(a.id5=b.id5)) < 9  /* Hence, the double inequality holds if and    */
order by missing(id1),id1,missing(id2),id2,missing(id3),id3,missing(id4),id4,missing(id5),id5; /* only if */
quit;                                                /* there is at least one summand=1 and no summand=9. */

proc print data=want;
run;

Please note that I've added a third observation to HAVE2 in order to demonstrate what happens to observations from HAVE2 which do not match with any observation from HAVE1. Currently, they are selected (full join). If you don't want to have them in dataset WANT, please change "full join" into "left join".

 

Also, please feel free to specify a different sort order for dataset WANT. My ORDER BY clause puts non-missing values of ID1 first, in ascending order, then missing values of ID1, same with ID2, ..., ID5.

 

Edit: Replaced CMISS by MISSING in the ORDER BY clause to avoid unnecessary warning messages from PROC SQL.

Ask a Question
Discussion stats
  • 2 replies
  • 172 views
  • 0 likes
  • 3 in conversation