topic Re: Eliminate duplicates if a condition is satisfied in SAS Procedures

Eliminate duplicates if a condition is satisfied

Aboiron — Tue, 28 Oct 2014 13:58:45 GMT

I want to eliminate duplicates from a database, based on an identifier, an order and a condition.

More precisely, if a condition is met (I can create a variable that equals 1 if it is and 0 otherwise), I would like to select one unique observation per identifier, based by its order (the last one). If this condition is not met then I want to keep all the observations related to this identifier.

Without the condition I can do that (or use a proc sql but that is not point)

proc sort data=have; 
 by identifier descending order; 
run; 

proc sort nudopkey data=have; 
 by identifier; 
run;

But how to incorporate my condition in this ?

For instance, with this dataset

data Test;

input identifier $ order condition;

datalines;

1023 1 0

1023 2 0

1098 1 0

1098 1 1

;

I would like to keep the lines :

-1023 2 0

-1098 1 0

-1098 1 1

Re: Eliminate duplicates if a condition is satisfied

RW9 — Tue, 28 Oct 2014 14:17:46 GMT

Do you mean you have a flag 1 or 0 in your data? If so then you could just add a where flag=1. You will probably have to do two steps, one for those with the condition true -> find max(), and those without.

proc sql;

create table WANT as

select *

from HAVE

where *condition is false*

union all

select *

from HAVE

where *condition is true*

having XYZ=(select max(XYZ) ...);

quit;

You could also do it in datastep, sort by condition max() value, then if first row is true the set output flag to 1 else 0.

Some test data might help.

Re: Eliminate duplicates if a condition is satisfied

Aboiron — Tue, 28 Oct 2014 14:40:25 GMT

added a minimal example.

If I use a where statement in my proc nodupkey, it removes my observations I want to keep

Re: Eliminate duplicates if a condition is satisfied

RW9 — Tue, 28 Oct 2014 15:26:35 GMT

Well simplest I can think of right now is:

data Test;
   input identifier $ order condition;
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;
run;

proc sql undo_policy=none;
delete from TEST A
where not exists(select distinct THIS.IDENTIFIER from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER and THIS.CONDITION=1)
and ORDER ne (select max(THIS.ORDER) from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER);
quit;

Re: Eliminate duplicates if a condition is satisfied

Ksharp — Wed, 29 Oct 2014 12:26:26 GMT

data Test; 
   input identifier $ order condition; 
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;             
run;
data want(drop=found);
 do until(last.identifier);
  set test;
  by identifier;
  if  condition then found=1;
 end; 
 do until(last.identifier);
  set test;
  by identifier;
  if not found then do; if last.identifier then output;end;
   else output;
 end;
run;

Xia Keshan

Re: Eliminate duplicates if a condition is satisfied

Loko — Wed, 29 Oct 2014 12:39:14 GMT

Hello,

data have (index=(pkey=(identifier condition order)));
   input identifier $ order condition;
   datalines;
2093 1 0
2093 2 0
1098 1 0
1098 1 1
1065 2 0
1065 1 0
1065 1 3
;

data want;
set have;
by identifier condition;
if last.condition;
run;