02-03-2016 12:28 PM
I have two datasets that I am trying to join to one another. The end goal would either be (preferably) a single dataset with all people from Dataset 1 and Dataset 2 with a variable (or variables) telling me which dataset(s) each person was found in, or separate datasets for 1. People found in both Dataset 1 and Dataset 2, 2. People found in just Dataset 1, 3. People found in just Dataset 2.
There are two tricky parts: One is that there are several different ID variables on which the datasets could be joined (which all need to be tried as no one, or even combination of two, can cover all the possible links that can be made). The other is that while Dataset 2 is at the person level, Dataset 1 is at the person/group level, and a row from Dataset 1 should only be joined to Dataset 2 if the group in Dataset 2 matches the group in Dataset 1.
ID1 ID2 ID3 ID4 ID5 Group.........categ_vars_from_dataset_1
1 . . . . A abc
1 . . . . B def
. 2 . . . A ghi
ID1 ID2 ID3 ID4 ID5 Group.........categ_vars_from_dataset_2
1 . . . . A xyz
. 2 . . . A jkl
Desired outcome dataset:
ID1 ID2 ID3 ID4 ID5 Group categ_vars_from_dataset_1 categ_vars_from_dataset_2 person_source_datasets
1 . . . . A abc xyz 1 and 2
1 . . . . B def . 1
. 2 . . . A ghi jkl 1 and 2
The first row of Dataset 1 can be joined to Dataset2 on ID1 and Group, so it has categorical vars from both datasets and the new variable person_source_datasets='1 and 2'. The second row of Dataset1 cannot be joined on any of the ID variables (they have a value for ID1 but their value for Group (B) is not found in the second dataset, so their row in the desired outcome dataset only has the categ vars from Dataset1 and their source dataset variable indicates that they come from Dataset 1 only. The third row of Dataset1 can be joined to Dataset2 on ID2 and Group, so it has categorical vars from both datasets.
I do need to try to join on every possible ID, though the order doesn't matter much (except that ID1 should be first as is it is the most populated). As I said, if it's easier to output separate datasets rather than having one summary dataset at the end, that's fine. Most people will have values for more than one ID variable (they can have values for all 5).
Any help is much appreciated.
02-03-2016 01:48 PM
There may be a slick way with SQL for this but sometimes a simple process is easier to keep track of. If I understand what you are attempting:
Proc sort data=dataset1; by id group; run; Proc sort data=dataset2; by id group; run; data MatchOn1 NoMatch1 merge dataset1 (in=in1) dataset2 (in=in2) ; by id group; if In1 and In2 then Match1=1; if Match1 then output MatchOn1; else output NoMatch1; run; data MatchOn2 NoMatch2 merge NoMatch1 (in=in1) dataset2 (in=in2) ; by id group; if In1 and In2 then Match2=1; if Match2 then output MatchOn2; else output NoMatch2; end; /* continue for 5 times ...*/ data want; set MatchOn1 MatchOn2 MatchOn3 MatchOn4 MatchOn5 NoMatch5 (in=InNoMatch) ; If InNoMatch then OnlyInSetOne=1; run;
The MatchOn1 flag would mean the first Id Matches on so on. The OnlyInSetOne=1 would be those that were only in dataset1.
You use a single MatchOn variable with different values is so desired.
02-03-2016 04:20 PM - edited 02-03-2016 04:36 PM
Here is a (not-really-slick) PROC SQL approach:
data have1; input ID1 ID2 ID3 ID4 ID5 Group $ categ_vars_from_dataset_1 $; cards; 1 . . . . A abc 1 . . . . B def . 2 . . . A ghi ; data have2; input ID1 ID2 ID3 ID4 ID5 Group $ categ_vars_from_dataset_2 $; cards; 1 . . . . A xyz . 2 . . . A jkl . 3 . . . B pqr ; proc sql; create table want as select coalesce(a.id1, b.id1) as id1, coalesce(a.id2, b.id2) as id2, coalesce(a.id3, b.id3) as id3, coalesce(a.id4, b.id4) as id4, coalesce(a.id5, b.id5) as id5, coalesce(a.group, b.group) as group, categ_vars_from_dataset_1, categ_vars_from_dataset_2, catx(' and ', a.source, b.source) as person_source_datasets from (select *, '1' as source from have1) a full join (select *, '2' as source from have2) b on a.group=b.group & 0 < (cmiss(a.id1, b.id1)=0)*(9-8*(a.id1=b.id1)) /* Each of the five summands is either */ +(cmiss(a.id2, b.id2)=0)*(9-8*(a.id2=b.id2)) /* (0 if one of the corresp. IDs is missing) or */ +(cmiss(a.id3, b.id3)=0)*(9-8*(a.id3=b.id3)) /* (1 if they are non-missing and equal) or */ +(cmiss(a.id4, b.id4)=0)*(9-8*(a.id4=b.id4)) /* (9 if they are non-missing and unequal). */ +(cmiss(a.id5, b.id5)=0)*(9-8*(a.id5=b.id5)) < 9 /* Hence, the double inequality holds if and */ order by missing(id1),id1,missing(id2),id2,missing(id3),id3,missing(id4),id4,missing(id5),id5; /* only if */ quit; /* there is at least one summand=1 and no summand=9. */ proc print data=want; run;
Please note that I've added a third observation to HAVE2 in order to demonstrate what happens to observations from HAVE2 which do not match with any observation from HAVE1. Currently, they are selected (full join). If you don't want to have them in dataset WANT, please change "full join" into "left join".
Also, please feel free to specify a different sort order for dataset WANT. My ORDER BY clause puts non-missing values of ID1 first, in ascending order, then missing values of ID1, same with ID2, ..., ID5.
Edit: Replaced CMISS by MISSING in the ORDER BY clause to avoid unnecessary warning messages from PROC SQL.