Re: Duplicated observations

JVargas · Posted 12-14-2020 10:12 PM

How to take out duplicated observations in a data step?

CodingDiSASter · Posted 12-14-2020 10:28 PM

You can use a proc sort function to remove duplicates by a given variable. In a given dataset 'have' we can remove duplicate names by:

Proc sort data=have nodupkey;

by name;

run;

The nodupkey lets you name a variable to remove duplicates named in the by statement (in this case 'name'). This code will not change the original dataset, only the output. If you need a new data set (example: a temporary set called 'want') with the duplicates removed, you can add an out statement:

Proc sort data=have nodupkey out=want;

by name;

run;

andreas_lds · Posted 12-15-2020 12:47 AM

See Maxim 7 and 14 in Maxims of Maximally Efficient SAS Programmers

If you have to use a data-step due to hardly comprehensible reasons, you either need to sort the data before processing it, if it is not at least grouped by the variable identifying a duplicate, or you could use a hash-object, if the dataset is not to large - it has to fit into memory available to your sas-session.

proc sort data= have out= sorted;
  by by_variables;
run;

data want;
  set sorted;
  by by_variables;
  if first.last_by_variable;
run;

If the data you have is grouped by the by-variables, then you can skip sorting and add "notsorted" to the by-statement in the data step.

Ksharp · Posted 12-15-2020 07:30 AM

proc sort data= have out= sorted;
by by_variables;
run;

data want_duplicated ;
set sorted;
by by_variables;
if not (first.last_by_variable and last.last_by_variable ) ;
run;

hswdl01 · Posted 12-15-2020 06:10 PM

You could do this in a PROC SORT step using nodupkey, but if you want to specifically do it in a data step you could run code similar to:

DATA WORK.data;
	BY	var;
		IF	FIRST.last_var;
	RUN;

that should get rid of the duplicates.

morganmetzger · Posted 12-15-2020 07:25 PM

You can use the "nodupkey" function.

Ex:

PROC SORT

DATA = WORK.libname NODUPKEY;

BY variable;

RUN;

Duplicated observations