BookmarkSubscribeRSS Feed
Aboiron
Calcite | Level 5

I want to eliminate duplicates from a database, based on an identifier, an order and a condition.

More precisely, if a condition is met (I can create a variable that equals 1 if it is and 0 otherwise), I would like to select one unique observation per identifier, based by its order (the last one). If this condition is not met then I want to keep all the observations related to this identifier.

Without the condition I can do that (or use a proc sql but that is not point)

proc sort data=have;

by identifier descending order;

run;

proc sort nudopkey data=have;

by identifier;

run;

But how to incorporate my condition in this ?

For instance, with this dataset

data Test;

   input identifier $ order condition;

   datalines;

1023 1 0

1023 2 0

1098 1 0

1098 1 1

;            

I would like to keep the lines :

-1023 2 0

-1098 1 0

-1098 1 1

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Do you mean you have a flag 1 or 0 in your data?  If so then you could just add a where flag=1.  You will probably have to do two steps, one for those with the condition true -> find max(), and those without.

proc sql;

     create table WANT as

     select     *

     from       HAVE

     where     *condition is false*

     union all

     select     *

     from       HAVE

     where     *condition is true*

     having    XYZ=(select max(XYZ) ...);

quit;  

You could also do it in datastep, sort by condition max() value, then if first row is true the set output flag to 1 else 0.

Some test data might help.

Aboiron
Calcite | Level 5

added a minimal example.

If I use a where statement in my proc nodupkey, it removes my observations I want to keep

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well simplest I can think of right now is:

data Test;
   input identifier $ order condition;
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;            
run;

proc sql undo_policy=none;
  delete from TEST A
  where not exists(select distinct THIS.IDENTIFIER from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER and THIS.CONDITION=1)
    and ORDER ne (select max(THIS.ORDER) from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER);
quit;

Ksharp
Super User
data Test; 
   input identifier $ order condition; 
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;             
run;
data want(drop=found);
 do until(last.identifier);
  set test;
  by identifier;
  if  condition then found=1;
 end; 
 do until(last.identifier);
  set test;
  by identifier;
  if not found then do; if last.identifier then output;end;
   else output;
 end;
run;

Xia Keshan

Loko
Barite | Level 11

Hello,

data have (index=(pkey=(identifier  condition order)));
   input identifier $ order condition;
   datalines;
2093 1 0
2093 2 0
1098 1 0
1098 1 1
1065 2 0
1065 1 0
1065 1 3
;   

data want;
set have;
by identifier condition;
if last.condition;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 5074 views
  • 0 likes
  • 4 in conversation