DATA Step, Macro, Functions and more

question about selectively removing duplicates

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 19
Accepted Solution

question about selectively removing duplicates

[ Edited ]

I have a question about how to remove duplicate rows based on the value of a certain variable. I'm using the example below to explain what I am effectively trying to do.

 

In this dataset, each id has many duplicate values for the variable med. In some of those rows, there is also a value for the variable flag. In those cases, I want to preserve only the rows where there is a value for flag. However, there are other rows for the id and med value where there is NO value for the variable flag (eg, where id = 1 and Med = B); in those cases, I want to just keep all of those rows intact.

 

id  med  flag   seq

1    A       1       1

1    A       .        2

1    A       .        3

1    B       .        1

1    B       .        2

2    A       1       1

2    A       .        2

2    C       .        1

2    C       0       2

2    C       .        3

2    C       .        4

2    D       1       1

 

 

This is the dataset that I want to have as the end result:

id  med  flag   seq

1    A       1       1

1    B       .        1

1    B       .        2

2    A       1       1

2    C       0       2

2    D       1       1

 

Does anyone have any advice?


Accepted Solutions
Solution
2 weeks ago
Occasional Contributor
Posts: 6

Re: question about selectively removing duplicates

[ Edited ]
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sort data=have out=flag (keep=id med) nodupkey;
  by id med;
  where flag^=.;
run;

data want1;
  merge have flag(in=_in);
  by id med;
  if (_in=1 and flag^=.) or _in=0 then output;
run;


or


data want2;
  set have;

  if _n_=1 then do;
     dcl hash hs( dataset:"have(where=(flag^=.))" );
     hs.definekey("id","med");
     hs.definedone();
  end;

  rc = hs.check();
  if rc =0 and flag^=. then output;
  if rc^=0 then output;
run;

View solution in original post


All Replies
Solution
2 weeks ago
Occasional Contributor
Posts: 6

Re: question about selectively removing duplicates

[ Edited ]
data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sort data=have out=flag (keep=id med) nodupkey;
  by id med;
  where flag^=.;
run;

data want1;
  merge have flag(in=_in);
  by id med;
  if (_in=1 and flag^=.) or _in=0 then output;
run;


or


data want2;
  set have;

  if _n_=1 then do;
     dcl hash hs( dataset:"have(where=(flag^=.))" );
     hs.definekey("id","med");
     hs.definedone();
  end;

  rc = hs.check();
  if rc =0 and flag^=. then output;
  if rc^=0 then output;
run;

Occasional Contributor
Posts: 19

Re: question about selectively removing duplicates

Thank you so much, amatsu! This worked! Smiley Happy

Super User
Posts: 10,681

Re: question about selectively removing duplicates

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;
data want;
 do until(last.med);
   set have;
   by id med notsorted;
   if not missing(flag) then yes=1;
 end;
 do until(last.med);
   set have;
   by id med notsorted;
   if yes then do;if not missing(flag) then output;end;
    else output;
 end;
 drop yes;
 run;
PROC Star
Posts: 1,558

Re: question about selectively removing duplicates

data have;
input id med$ flag seq;
cards;
1 A 1 1
1 A . 2
1 A . 3
1 B . 1
1 B . 2
2 A 1 1
2 A . 2
2 C . 1
2 C 0 2
2 C . 3
2 C . 4
2 D 1 1
;

proc sql;
create table want as
select *
from have
group by id,med
having flag=max(flag)
order by id,med,seq;
quit;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 4 replies
  • 129 views
  • 0 likes
  • 4 in conversation