I want to eliminate duplicates from a database, based on an identifier, an order and a condition.
More precisely, if a condition is met (I can create a variable that equals 1 if it is and 0 otherwise), I would like to select one unique observation per identifier, based by its order (the last one). If this condition is not met then I want to keep all the observations related to this identifier.
Without the condition I can do that (or use a proc sql but that is not point)
proc sort data=have;
by identifier descending order;
run;
proc sort nudopkey data=have;
by identifier;
run;
But how to incorporate my condition in this ?
For instance, with this dataset
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;
I would like to keep the lines :
-1023 2 0
-1098 1 0
-1098 1 1
Do you mean you have a flag 1 or 0 in your data? If so then you could just add a where flag=1. You will probably have to do two steps, one for those with the condition true -> find max(), and those without.
proc sql;
create table WANT as
select *
from HAVE
where *condition is false*
union all
select *
from HAVE
where *condition is true*
having XYZ=(select max(XYZ) ...);
quit;
You could also do it in datastep, sort by condition max() value, then if first row is true the set output flag to 1 else 0.
Some test data might help.
added a minimal example.
If I use a where statement in my proc nodupkey, it removes my observations I want to keep
Well simplest I can think of right now is:
data Test;
input identifier $ order condition;
datalines;
1023 1 0
1023 2 0
1098 1 0
1098 1 1
;
run;
proc sql undo_policy=none;
delete from TEST A
where not exists(select distinct THIS.IDENTIFIER from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER and THIS.CONDITION=1)
and ORDER ne (select max(THIS.ORDER) from TEST THIS where THIS.IDENTIFIER=A.IDENTIFIER);
quit;
data Test; input identifier $ order condition; datalines; 1023 1 0 1023 2 0 1098 1 0 1098 1 1 ; run; data want(drop=found); do until(last.identifier); set test; by identifier; if condition then found=1; end; do until(last.identifier); set test; by identifier; if not found then do; if last.identifier then output;end; else output; end; run;
Xia Keshan
Hello,
data have (index=(pkey=(identifier condition order)));
input identifier $ order condition;
datalines;
2093 1 0
2093 2 0
1098 1 0
1098 1 1
1065 2 0
1065 1 0
1065 1 3
;
data want;
set have;
by identifier condition;
if last.condition;
run;
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.