Hello all,
I have a large dataset. Individuals can appear on more than 1 row based on an ID. I need to subset the data to have individuals that are given both treatment A and B.
Example data:
ID treatment
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
Desired Result:
ID treatment
1 A
1 B
1 A
4 A
4 B
Any help is appreciated.
Here's the input version.
DATA have;
INPUT id treatment $;
CARDS;
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
;
RUN;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.