Solved: question about selectively removing duplicates

SarahW13 · Posted 06-05-2018 01:21 AM

I have a question about how to remove duplicate rows based on the value of a certain variable. I'm using the example below to explain what I am effectively trying to do.

In this dataset, each id has many duplicate values for the variable med. In some of those rows, there is also a value for the variable flag. In those cases, I want to preserve only the rows where there is a value for flag. However, there are other rows for the id and med value where there is NO value for the variable flag (eg, where id = 1 and Med = B); in those cases, I want to just keep all of those rows intact.

id med flag seq

1 A 1 1

1 A . 2

1 A . 3

1 B . 1

1 B . 2

2 A 1 1

2 A . 2

2 C . 1

2 C 0 2

2 C . 3

2 C . 4

2 D 1 1

This is the dataset that I want to have as the end result:

id med flag seq

1 A 1 1

1 B . 1

1 B . 2

2 A 1 1

2 C 0 2

2 D 1 1

Does anyone have any advice?

amatsu · Posted 06-05-2018 02:15 AM

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sort data=have out=flag (keep=id med) nodupkey;
  by id med;
  where flag^=.;
run;

data want1;
  merge have flag(in=_in);
  by id med;
  if (_in=1 and flag^=.) or _in=0 then output;
run;


or


data want2;
  set have;

  if _n_=1 then do;
     dcl hash hs( dataset:"have(where=(flag^=.))" );
     hs.definekey("id","med");
     hs.definedone();
  end;

  rc = hs.check();
  if rc =0 and flag^=. then output;
  if rc^=0 then output;
run;

View solution in original post

amatsu · Posted 06-05-2018 02:15 AM

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sort data=have out=flag (keep=id med) nodupkey;
  by id med;
  where flag^=.;
run;

data want1;
  merge have flag(in=_in);
  by id med;
  if (_in=1 and flag^=.) or _in=0 then output;
run;


or


data want2;
  set have;

  if _n_=1 then do;
     dcl hash hs( dataset:"have(where=(flag^=.))" );
     hs.definekey("id","med");
     hs.definedone();
  end;

  rc = hs.check();
  if rc =0 and flag^=. then output;
  if rc^=0 then output;
run;

SarahW13 · Posted 06-05-2018 02:29 AM

Thank you so much, amatsu! This worked! 🙂

Ksharp · Posted 06-05-2018 08:52 AM

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
data want;
 do until(last.med);
   set have;
   by id med notsorted;
   if not missing(flag) then yes=1;
 end;
 do until(last.med);
   set have;
   by id med notsorted;
   if yes then do;if not missing(flag) then output;end;
    else output;
 end;
 drop yes;
 run;

novinosrin · Posted 06-05-2018 10:20 AM

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sql;
create table want as
select *
from have
group by id,med
having flag=max(flag)
order by id,med,seq;
quit;

question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Re: question about selectively removing duplicates

Registration is open

SAS Training: Just a Click Away