Hello all,
I have a large dataset. Individuals can appear on more than 1 row based on an ID. I need to subset the data to have individuals that are given both treatment A and B.
Example data:
ID treatment
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
Desired Result:
ID treatment
1 A
1 B
1 A
4 A
4 B
Any help is appreciated.
Here's the input version.
DATA have;
INPUT id treatment $;
CARDS;
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
;
RUN;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.