BookmarkSubscribeRSS Feed
JVargas
Fluorite | Level 6

How to take out duplicated observations in a data step?

5 REPLIES 5
CodingDiSASter
Fluorite | Level 6

You can use a proc sort function to remove duplicates by a given variable. In a given dataset 'have' we can remove duplicate names by:

 

Proc sort data=have nodupkey;

by name;

run;

 

The nodupkey lets you name a variable to remove duplicates named in the by statement (in this case 'name'). This code will not change the original dataset, only the output. If you need a new data set (example: a temporary set called 'want') with the duplicates removed, you can add an out statement:

 

Proc sort data=have nodupkey out=want;

by name;

run;

andreas_lds
Jade | Level 19

See Maxim 7 and 14 in Maxims of Maximally Efficient SAS Programmers 

If you have to use a data-step due to hardly comprehensible reasons, you either need to sort the data before processing it, if it is not at least grouped by the variable identifying a duplicate, or you could use a hash-object, if the dataset is not to large - it has to fit into memory available to your sas-session.

proc sort data= have out= sorted;
  by by_variables;
run;

data want;
  set sorted;
  by by_variables;
  if first.last_by_variable;
run;

If the data you have is grouped by the by-variables, then you can skip sorting and add "notsorted" to the by-statement in the data step.

Ksharp
Super User
proc sort data= have out= sorted;
by by_variables;
run;

data want_duplicated ;
set sorted;
by by_variables;
if not (first.last_by_variable and last.last_by_variable ) ;
run;
hswdl01
Fluorite | Level 6

You could do this in a PROC SORT step using nodupkey, but if you want to specifically do it in a data step you could run code similar to:

DATA WORK.data;
	BY	var;
		IF	FIRST.last_var;
	RUN;

that should get rid of the duplicates.

morganmetzger
Calcite | Level 5

You can use the "nodupkey" function.

Ex:

PROC SORT

DATA = WORK.libname NODUPKEY;

BY variable;

RUN;