How to take out duplicated observations in a data step?
You can use a proc sort function to remove duplicates by a given variable. In a given dataset 'have' we can remove duplicate names by:
Proc sort data=have nodupkey;
by name;
run;
The nodupkey lets you name a variable to remove duplicates named in the by statement (in this case 'name'). This code will not change the original dataset, only the output. If you need a new data set (example: a temporary set called 'want') with the duplicates removed, you can add an out statement:
Proc sort data=have nodupkey out=want;
by name;
run;
See Maxim 7 and 14 in Maxims of Maximally Efficient SAS Programmers
If you have to use a data-step due to hardly comprehensible reasons, you either need to sort the data before processing it, if it is not at least grouped by the variable identifying a duplicate, or you could use a hash-object, if the dataset is not to large - it has to fit into memory available to your sas-session.
proc sort data= have out= sorted;
by by_variables;
run;
data want;
set sorted;
by by_variables;
if first.last_by_variable;
run;
If the data you have is grouped by the by-variables, then you can skip sorting and add "notsorted" to the by-statement in the data step.
You could do this in a PROC SORT step using nodupkey, but if you want to specifically do it in a data step you could run code similar to:
DATA WORK.data; BY var; IF FIRST.last_var; RUN;
that should get rid of the duplicates.
You can use the "nodupkey" function.
Ex:
PROC SORT
DATA = WORK.libname NODUPKEY;
BY variable;
RUN;
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.