BookmarkSubscribeRSS Feed
Aboiron
Calcite | Level 5

I want to eliminate duplicates from a database, based on an identifier, an order and a condition.

More precisely, if a condition is met (I can create a variable that equals 1 if it is and 0 otherwise), I would like to select one unique observation per identifier, based by its order (the last one). If this condition is not met then I want to keep all the observations related to this identifier.

Without the condition I can do that (or use a proc sql but that is not point)

proc sort data=have;

by identifier descending order;

run;

proc sort nudopkey data=have;

by identifier;

run;

But how to incorporate my condition in this ?

For instance, with this dataset

data Test;

   input identifier $ order condition;

   datalines;

1023 1 0

1023 2 0

1098 1 0

1098 1 1

;            

I would like to keep the lines :

-1023 2 0

-1098 1 0

-1098 1 1

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Do you mean you have a flag 1 or 0 in your data?  If so then you could just add a where flag=1.  You will probably have to do two steps, one for those with the condition true -> find max(), and those without.

proc sql;

     create table WANT as

     select     *

     from       HAVE

     where     *condition is false*

     union all

     select     *

     from       HAVE

     where     *condition is true*

     having    XYZ=(select max(XYZ) ...);

quit;  

You could also do it in datastep, sort by condition max() value, then if first row is true the set output flag to 1 else 0.

Some test data might help.

Aboiron
Calcite | Level 5

added a minimal example.

If I use a where statement in my proc nodupkey, it removes my observations I want to keep

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Well simplest I can think of right now is:

data Test;
   input identifier $ order condition;
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;            
run;

proc sql undo_policy=none;
  delete from TEST A
  where not exists(select distinct THIS.IDENTIFIER from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER and THIS.CONDITION=1)
    and ORDER ne (select max(THIS.ORDER) from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER);
quit;

Ksharp
Super User
data Test; 
   input identifier $ order condition; 
   datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;             
run;
data want(drop=found);
 do until(last.identifier);
  set test;
  by identifier;
  if  condition then found=1;
 end; 
 do until(last.identifier);
  set test;
  by identifier;
  if not found then do; if last.identifier then output;end;
   else output;
 end;
run;

Xia Keshan

Loko
Barite | Level 11

Hello,

data have (index=(pkey=(identifier  condition order)));
   input identifier $ order condition;
   datalines;
2093 1 0
2093 2 0
1098 1 0
1098 1 1
1065 2 0
1065 1 0
1065 1 3
;   

data want;
set have;
by identifier condition;
if last.condition;
run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 4592 views
  • 0 likes
  • 4 in conversation