I want to eliminate duplicates from a database, based on an identifier, an order and a condition.
More precisely, if a condition is met (I can create a variable that equals 1 if it is and 0 otherwise), I would like to select one unique observation per identifier, based by its order (the last one). If this condition is not met then I want to keep all the observations related to this identifier.
Without the condition I can do that (or use a proc sql but that is not point)
proc sort data=have;
by identifier descending order;
run;
proc sort nudopkey data=have;
by identifier;
run;
But how to incorporate my condition in this ?
For instance, with this dataset
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;
I would like to keep the lines :
-1023 2 0
-1098 1 0
-1098 1 1
Do you mean you have a flag 1 or 0 in your data? If so then you could just add a where flag=1. You will probably have to do two steps, one for those with the condition true -> find max(), and those without.
proc sql;
create table WANT as
select *
from HAVE
where *condition is false*
union all
select *
from HAVE
where *condition is true*
having XYZ=(select max(XYZ) ...);
quit;
You could also do it in datastep, sort by condition max() value, then if first row is true the set output flag to 1 else 0.
Some test data might help.
added a minimal example.
If I use a where statement in my proc nodupkey, it removes my observations I want to keep
Well simplest I can think of right now is:
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;
run;
proc sql undo_policy=none;
delete from TEST A
where not exists(select distinct THIS.IDENTIFIER from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER and THIS.CONDITION=1)
and ORDER ne (select max(THIS.ORDER) from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER);
quit;
data Test; input identifier $ order condition; datalines; 1023 1 0 1023 2 0 1098 1 0 1098 1 1 ; run; data want(drop=found); do until(last.identifier); set test; by identifier; if condition then found=1; end; do until(last.identifier); set test; by identifier; if not found then do; if last.identifier then output;end; else output; end; run;
Xia Keshan
Hello,
data have (index=(pkey=(identifier condition order)));
input identifier $ order condition;
datalines;
2093 1 0
2093 2 0
1098 1 0
1098 1 1
1065 2 0
1065 1 0
1065 1 3
;
data want;
set have;
by identifier condition;
if last.condition;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.