Hello everyone and thanks for your help.
I am combining multiple datasets, I've created a flag for each dataset - e.g., d1, d2, d3 etc.
I have multiple observations per person and each observation can come from a different dataset.
In this particular case, there are 3 possible datasets, but I've limited my sample to only those whose data comes from two different datasets.
The data looks something like:
ID dataset
1 d1
1 d1
1 d1
1 d2
1 d1
2 d1
2 d3
2 d1
3 d2
3 d3
I want to find out how many people have data from d1/d2, how many from d1/d3, and how many from d2/d3.
(I am not interested in knowing how many observations come from each dataset - for example, for person 1 I do not want to know they have 4 observations from d1 and 1 observation from d2 - I just want to know their data is from d1/d2).
I hope that makes sense...
Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3. If that's the case then the program below will scale nicely. Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.
Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.
data have;
input id dataset :$2. @@;
datalines;
1 d1 1 d2 1 d3
2 d3 2 d2
3 d3 3 d1
4 d2 4 d1
5 d3
6 d2
7 d1
run;
data want (drop=dataset _:);
merge have (keep=id dataset where=(dataset='d1') in=in1)
have (keep=id dataset where=(dataset='d2') in=in2)
have (keep=id dataset where=(dataset='d3') in=in3) ;
by id;
array in {3} in1-in3;
if first.id and sum(of in{*})>=2;
array dd {3,3} _dum d1d2 d1d3
_dum _dum d2d3
_dum _dum _dum ;
do _row=1 to dim(in)-1;
do _col=_row+1 to dim(in);
if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
end;
end;
run;
proc means data=want n missing;
var d1d2--d2d3;
run;
Try this one:
data work.want(keep= Id d1d2 d1d3 d2d3);
set work.have;
by id;
length inD1 inD2 inD3 d1d2 d1d3 d2d3 i 3;
array in inD:;
retain inD:;
if first.id then do;
call missing(of inD:);
end;
i = input(substr(dataset, 2), 1.);
in{i} = 1;
if last.id then do;
d1d2 = inD1 and inD2;
d1d3 = inD1 and inD3;
d2d3 = inD2 and inD3;
output;
end;
run;
Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3. If that's the case then the program below will scale nicely. Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.
Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.
data have;
input id dataset :$2. @@;
datalines;
1 d1 1 d2 1 d3
2 d3 2 d2
3 d3 3 d1
4 d2 4 d1
5 d3
6 d2
7 d1
run;
data want (drop=dataset _:);
merge have (keep=id dataset where=(dataset='d1') in=in1)
have (keep=id dataset where=(dataset='d2') in=in2)
have (keep=id dataset where=(dataset='d3') in=in3) ;
by id;
array in {3} in1-in3;
if first.id and sum(of in{*})>=2;
array dd {3,3} _dum d1d2 d1d3
_dum _dum d2d3
_dum _dum _dum ;
do _row=1 to dim(in)-1;
do _col=_row+1 to dim(in);
if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
end;
end;
run;
proc means data=want n missing;
var d1d2--d2d3;
run;
It worked beautifully, thank you!
Is this a better mousetrap? For convenience, I will assume you have up to 9 sources (so the maximum value is "d9") and the length of DATASET is $ 2. If the actual data is different, adjustments can be made.
data want;
length all_sources $ 27;
do until (last.ID);
set have;
by id;
start = input(substr(dataset, 2), 1.);
substr(all_sources, start*3-2, 2) = dataset;
end;
drop start;
run;
proc freq data=want;
tables all_sources;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.