I have a question about how to remove duplicate rows based on the value of a certain variable. I'm using the example below to explain what I am effectively trying to do.
In this dataset, each id has many duplicate values for the variable med. In some of those rows, there is also a value for the variable flag. In those cases, I want to preserve only the rows where there is a value for flag. However, there are other rows for the id and med value where there is NO value for the variable flag (eg, where id = 1 and Med = B); in those cases, I want to just keep all of those rows intact.
id med flag seq
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
This is the dataset that I want to have as the end result:
id med flag seq
1 A 1 1
1 B . 1
1 B . 2
2 A 1 1
2 C 0 2
2 D 1 1
Does anyone have any advice?
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sort data=have out=flag (keep=id med) nodupkey;
by id med;
where flag^=.;
run;
data want1;
merge have flag(in=_in);
by id med;
if (_in=1 and flag^=.) or _in=0 then output;
run;
or
data want2;
set have;
if _n_=1 then do;
dcl hash hs( dataset:"have(where=(flag^=.))" );
hs.definekey("id","med");
hs.definedone();
end;
rc = hs.check();
if rc =0 and flag^=. then output;
if rc^=0 then output;
run;
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sort data=have out=flag (keep=id med) nodupkey;
by id med;
where flag^=.;
run;
data want1;
merge have flag(in=_in);
by id med;
if (_in=1 and flag^=.) or _in=0 then output;
run;
or
data want2;
set have;
if _n_=1 then do;
dcl hash hs( dataset:"have(where=(flag^=.))" );
hs.definekey("id","med");
hs.definedone();
end;
rc = hs.check();
if rc =0 and flag^=. then output;
if rc^=0 then output;
run;
Thank you so much, amatsu! This worked! 🙂
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
data want;
do until(last.med);
set have;
by id med notsorted;
if not missing(flag) then yes=1;
end;
do until(last.med);
set have;
by id med notsorted;
if yes then do;if not missing(flag) then output;end;
else output;
end;
drop yes;
run;
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
proc sql;
create table want as
select *
from have
group by id,med
having flag=max(flag)
order by id,med,seq;
quit;
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.