Hello all,
I have a large dataset. Individuals can appear on more than 1 row based on an ID. I need to subset the data to have individuals that are given both treatment A and B.
Example data:
ID treatment
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
Desired Result:
ID treatment
1 A
1 B
1 A
4 A
4 B
Any help is appreciated.
Here's the input version.
DATA have;
INPUT id treatment $;
CARDS;
1 A
1 B
1 A
2 A
2 A
3 B
3 B
3 B
3 B
4 A
4 B
5 A
5 A
6 B
;
RUN;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
You could ask PROC SQL to do exactly what you just described.
proc sql;
create table want as
select a.* from have a
where exists (select id from have b where a.id=b.id and b.treatment='A')
and exists (select id from have c where a.id=c.id and c.treatment='B')
;
quit;
Or if there are just two treatments perhaps it is easier to use GROUP BY instead?
proc sql;
create table want2 as
select * from have
group by id
having count(distinct treatment)=2
;
quit;
But if your dataset is really large (and already sorted by ID) then a data step will be faster.
data want3;
merge have(in=in1)
have(in=in2 keep=id treatment rename=(treatment=trtA) where=(trtA='A'))
have(in=in3 keep=id treatment rename=(treatment=trtB) where=(trtB='B'))
;
by id;
if first.id then selected = (in1 and in2 and in3);
retain selected ;
if in1 and selected then output;
drop trtA trtB selected;
run;
Or perhaps a data step with a double DOW loop?
data want4;
do until (last.id);
set have;
by id;
A=sum(A,treatment='A');
B=sum(B,treatment='B');
end;
do until (last.id);
set have;
by id;
if A and B then output;
end;
drop A B ;
run;
Available on demand!
Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Select SAS Training centers are offering in-person courses. View upcoming courses for: