Hello everyone and thanks for your help.
I am combining multiple datasets, I've created a flag for each dataset - e.g., d1, d2, d3 etc.
I have multiple observations per person and each observation can come from a different dataset.
In this particular case, there are 3 possible datasets, but I've limited my sample to only those whose data comes from two different datasets.
The data looks something like:
ID dataset
1 d1
1 d1
1 d1
1 d2
1 d1
2 d1
2 d3
2 d1
3 d2
3 d3
I want to find out how many people have data from d1/d2, how many from d1/d3, and how many from d2/d3.
(I am not interested in knowing how many observations come from each dataset - for example, for person 1 I do not want to know they have 4 observations from d1 and 1 observation from d2 - I just want to know their data is from d1/d2).
I hope that makes sense...
Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3. If that's the case then the program below will scale nicely. Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.
Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.
data have;
input id dataset :$2. @@;
datalines;
1 d1 1 d2 1 d3
2 d3 2 d2
3 d3 3 d1
4 d2 4 d1
5 d3
6 d2
7 d1
run;
data want (drop=dataset _:);
merge have (keep=id dataset where=(dataset='d1') in=in1)
have (keep=id dataset where=(dataset='d2') in=in2)
have (keep=id dataset where=(dataset='d3') in=in3) ;
by id;
array in {3} in1-in3;
if first.id and sum(of in{*})>=2;
array dd {3,3} _dum d1d2 d1d3
_dum _dum d2d3
_dum _dum _dum ;
do _row=1 to dim(in)-1;
do _col=_row+1 to dim(in);
if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
end;
end;
run;
proc means data=want n missing;
var d1d2--d2d3;
run;
Try this one:
data work.want(keep= Id d1d2 d1d3 d2d3);
set work.have;
by id;
length inD1 inD2 inD3 d1d2 d1d3 d2d3 i 3;
array in inD:;
retain inD:;
if first.id then do;
call missing(of inD:);
end;
i = input(substr(dataset, 2), 1.);
in{i} = 1;
if last.id then do;
d1d2 = inD1 and inD2;
d1d3 = inD1 and inD3;
d2d3 = inD2 and inD3;
output;
end;
run;
Given you said "multiple" datasets, it seems you might be dealing with several datasets, not just 2 or 3. If that's the case then the program below will scale nicely. Using the "if first.id and sum(of in{*})>=2 keeps only one record per id, and only those id's present in 2 or more datasets.
Using the 2-dimensional array DD provides a convenient way to compactly assign all 2-dataset indicators.
data have;
input id dataset :$2. @@;
datalines;
1 d1 1 d2 1 d3
2 d3 2 d2
3 d3 3 d1
4 d2 4 d1
5 d3
6 d2
7 d1
run;
data want (drop=dataset _:);
merge have (keep=id dataset where=(dataset='d1') in=in1)
have (keep=id dataset where=(dataset='d2') in=in2)
have (keep=id dataset where=(dataset='d3') in=in3) ;
by id;
array in {3} in1-in3;
if first.id and sum(of in{*})>=2;
array dd {3,3} _dum d1d2 d1d3
_dum _dum d2d3
_dum _dum _dum ;
do _row=1 to dim(in)-1;
do _col=_row+1 to dim(in);
if in{_row}=1 and in{_col}=1 then dd{_row,_col}=1;
end;
end;
run;
proc means data=want n missing;
var d1d2--d2d3;
run;
It worked beautifully, thank you!
Is this a better mousetrap? For convenience, I will assume you have up to 9 sources (so the maximum value is "d9") and the length of DATASET is $ 2. If the actual data is different, adjustments can be made.
data want;
length all_sources $ 27;
do until (last.ID);
set have;
by id;
start = input(substr(dataset, 2), 1.);
substr(all_sources, start*3-2, 2) = dataset;
end;
drop start;
run;
proc freq data=want;
tables all_sources;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.