topic Re: Duplicated observations in New SAS User

Duplicated observations

JVargas — Tue, 15 Dec 2020 03:12:44 GMT

How to take out duplicated observations in a data step?

Re: Duplicated observations

CodingDiSASter — Tue, 15 Dec 2020 03:28:46 GMT

You can use a proc sort function to remove duplicates by a given variable. In a given dataset 'have' we can remove duplicate names by:

Proc sort data=have nodupkey;

by name;

run;

The nodupkey lets you name a variable to remove duplicates named in the by statement (in this case 'name'). This code will not change the original dataset, only the output. If you need a new data set (example: a temporary set called 'want') with the duplicates removed, you can add an out statement:

Proc sort data=have nodupkey out=want;

by name;

run;

Re: Duplicated observations

andreas_lds — Tue, 15 Dec 2020 05:47:50 GMT

See Maxim 7 and 14 in Maxims of Maximally Efficient SAS Programmers

If you have to use a data-step due to hardly comprehensible reasons, you either need to sort the data before processing it, if it is not at least grouped by the variable identifying a duplicate, or you could use a hash-object, if the dataset is not to large - it has to fit into memory available to your sas-session.

proc sort data= have out= sorted;
  by by_variables;
run;

data want;
  set sorted;
  by by_variables;
  if first.last_by_variable;
run;

If the data you have is grouped by the by-variables, then you can skip sorting and add "notsorted" to the by-statement in the data step.

Re: Duplicated observations

Ksharp — Tue, 15 Dec 2020 12:30:20 GMT

proc sort data= have out= sorted;
by by_variables;
run;

data want_duplicated ;
set sorted;
by by_variables;
if not (first.last_by_variable and last.last_by_variable ) ;
run;

Re: Duplicated observations

hswdl01 — Tue, 15 Dec 2020 23:10:25 GMT

You could do this in a PROC SORT step using nodupkey, but if you want to specifically do it in a data step you could run code similar to:

DATA WORK.data;
	BY	var;
		IF	FIRST.last_var;
	RUN;

that should get rid of the duplicates.

Re: Duplicated observations

morganmetzger — Wed, 16 Dec 2020 00:25:47 GMT

You can use the "nodupkey" function.

Ex:

PROC SORT

DATA = WORK.libname NODUPKEY;

BY variable;

RUN;