How to take out duplicated observations in a data step?
You can use a proc sort function to remove duplicates by a given variable. In a given dataset 'have' we can remove duplicate names by:
Proc sort data=have nodupkey;
by name;
run;
The nodupkey lets you name a variable to remove duplicates named in the by statement (in this case 'name'). This code will not change the original dataset, only the output. If you need a new data set (example: a temporary set called 'want') with the duplicates removed, you can add an out statement:
Proc sort data=have nodupkey out=want;
by name;
run;
See Maxim 7 and 14 in Maxims of Maximally Efficient SAS Programmers
If you have to use a data-step due to hardly comprehensible reasons, you either need to sort the data before processing it, if it is not at least grouped by the variable identifying a duplicate, or you could use a hash-object, if the dataset is not to large - it has to fit into memory available to your sas-session.
proc sort data= have out= sorted;
by by_variables;
run;
data want;
set sorted;
by by_variables;
if first.last_by_variable;
run;
If the data you have is grouped by the by-variables, then you can skip sorting and add "notsorted" to the by-statement in the data step.
You could do this in a PROC SORT step using nodupkey, but if you want to specifically do it in a data step you could run code similar to:
DATA WORK.data; BY var; IF FIRST.last_var; RUN;
that should get rid of the duplicates.
You can use the "nodupkey" function.
Ex:
PROC SORT
DATA = WORK.libname NODUPKEY;
BY variable;
RUN;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.