I have a question about how to remove duplicate rows based on the value of a certain variable. I'm using the example below to explain what I am effectively trying to do.
In this dataset, each id has many duplicate values for the variable med. In some of those rows, there is also a value for the variable flag. In those cases, I want to preserve only the rows where there is a value for flag. However, there are other rows for the id and med value where there is NO value for the variable flag (eg, where id = 1 and Med = B); in those cases, I want to just keep all of those rows intact.
id med flag seq
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
This is the dataset that I want to have as the end result:
id med flag seq
1 A 1 1
1 B . 1
1 B . 2
2 A 1 1
2 C 0 2
2 D 1 1
Does anyone have any advice?
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sort data=have out=flag (keep=id med) nodupkey;
by id med;
where flag^=.;
run;
data want1;
merge have flag(in=_in);
by id med;
if (_in=1 and flag^=.) or _in=0 then output;
run;
or
data want2;
set have;
if _n_=1 then do;
dcl hash hs( dataset:"have(where=(flag^=.))" );
hs.definekey("id","med");
hs.definedone();
end;
rc = hs.check();
if rc =0 and flag^=. then output;
if rc^=0 then output;
run;
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sort data=have out=flag (keep=id med) nodupkey;
by id med;
where flag^=.;
run;
data want1;
merge have flag(in=_in);
by id med;
if (_in=1 and flag^=.) or _in=0 then output;
run;
or
data want2;
set have;
if _n_=1 then do;
dcl hash hs( dataset:"have(where=(flag^=.))" );
hs.definekey("id","med");
hs.definedone();
end;
rc = hs.check();
if rc =0 and flag^=. then output;
if rc^=0 then output;
run;
Thank you so much, amatsu! This worked! 🙂
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
data want;
do until(last.med);
set have;
by id med notsorted;
if not missing(flag) then yes=1;
end;
do until(last.med);
set have;
by id med notsorted;
if yes then do;if not missing(flag) then output;end;
else output;
end;
drop yes;
run;
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sql;
create table want as
select *
from have
group by id,med
having flag=max(flag)
order by id,med,seq;
quit;
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.