Solved: proc sort with nonduprecs does not work with data = ds(keep=)

skcussas · Posted 01-10-2019 11:40 AM

Hi everyone,

I'm trying to find the non duplicated rows of certain columns from a data set using proc sort.

For example, finding the unique combination of MAKE, TYPE, and ORIGIN in sashelp.cars

Here is my code

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE;
run;

However, the resulting data set still contains duplicated rows.

Can anyone explain to me why my code doesn't work as expect?

Thanks for your help.

novinosrin · Posted 01-10-2019 11:49 AM

Nodupkey works

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;

View solution in original post

novinosrin · Posted 01-10-2019 11:46 AM

Try adding origin to your by statement

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE ORIGIN;
run;

novinosrin · Posted 01-10-2019 11:49 AM

Nodupkey works

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;

skcussas · Posted 01-10-2019 11:58 AM

Thanks nodupkeys works!

skcussas · Posted 01-10-2019 11:52 AM

I still get duplicated rows with the modification you suggest

Reeza · Posted 01-10-2019 11:51 AM

Your data has to be sorted twice for NODUPREC to work correctly - it only removes duplicates in order. It's an annoying feature, I thought they'd actually removed it. Otherwise, you can use NODUPKEY to remove duplicates.

This is a common gotcha with NODUPRECS.

elolvido · Posted 03-17-2021 03:51 PM

I really hope they fix this.

Reeza · Posted 03-17-2021 05:20 PM

NODUPRECS is not documented and not recommended. Use _ALL_ instead to sort by ALL and NODUPKEY.
https://documentation.sas.com/?cdcId=pgmsascdc&cdcVersion=9.4_3.5&docsetId=proc&docsetTarget=p02bhn8...

elolvido · Posted 03-18-2021 05:46 PM

sweet, thanks!

data_null__ · Posted 01-10-2019 11:52 AM

Looks like PROC SORT does not honor the KEEP the way one might expect. You can use a view to get what you want.

182  proc sort data=sashelp.cars(keep=MAKE TYPE ORIGIN) out=dsout noduprec;
183     by MAKE TYPE;
184     run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 428 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


185
186  data carsV / view=carsV;
187     set sashelp.cars(keep=MAKE TYPE ORIGIN);
188     run;

NOTE: DATA STEP view saved on file WORK.CARSV.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


189  proc sort data=carsV out=dsout noduprec;
190     by MAKE TYPE;
191     run;

NOTE: There were 428 observations read from the data set WORK.CARSV.
NOTE: View WORK.CARSV.VIEW used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds

novinosrin · Posted 01-10-2019 11:54 AM

Guru , I honestly didn't see your message. Forgive me for the near duplicate message besides your view making it distinct. Sorry

ChrisNZ · Posted 03-18-2021 08:32 PM

I agree that option NODUPREC must be used carefully, as @Reeza show us is outlined in the documentation.

The issues of option KEEP= not working as expected, as @data_null__ demonstrates, seems at odds with what this option is supposed to mean.

Shouldn't this be seen as a defect?

@skcussas You should start by reading the documentation before asking a question. This behaviour of option NODUPREC is nothing new.

High-Performance SAS Coding - Third Edition

novinosrin · Posted 01-10-2019 11:53 AM

For noduprecs to work, I am afraid you need a datastep as it's apparent keep= dataset option in the proc sort data= doesn;t compile as expected.

So here's the work around


85   data w;
86   set sashelp.cars;
87   keep make type origin;
88   run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.W has 428 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


89   proc sort data=w out=w1 noduprecs;
90   by _all_;
91   run;

NOTE: There were 428 observations read from the data set WORK.W.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.W1 has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds

Reeza · Posted 01-10-2019 11:56 AM

NODUPREC is not included in the SAS 9.4 documentation for PROC SORT. I do believe it has been removed due to this common issue.

It was likely left in for backward compatibility.

https://documentation.sas.com/?docsetId=proc&docsetTarget=p02bhn81rn4u64n1b6l00ftdnxge.htm&docsetVer...

From the 9.2 version of the documentation, this behaviour is explicitly mentioned:

NODUPRECS

checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to the ones for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Note: See NODUPKEY for information about eliminating observations with duplicate BY values.

Alias :	NODUP
Interaction:	When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.
Interaction:	The action of NODUPRECS is directly related to the setting of the SORTDUP= system option. When SORTDUP= is set to LOGICAL, NODUPRECS removes duplicate observations based on the examination of the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed, because it eliminates variables before observation comparisons take place. Also, setting SORTDUP=LOGICAL can improve performance, because dropping variables before sorting reduces the amount of memory required to perform the sort. When SORTDUP= is set to PHYSICAL, NODUPRECS examines all variables in the data set, regardless of whether they have been kept or dropped. For more information about SORTDUP=, see the chapter on SAS system options in SAS Language Reference: Dictionary.
Interaction:	In-database processing does not occur when the NODUPRECS option is specified. However, if the NODUPRECS and NODUPKEY options are specified, system option SQLGENERATION= set for in-database processing, and system option SORTPGM=BEST, the NODUPRECS option is ignored and in-database processing does occur.
Tip:	Use the EQUALS option with the NODUPRECS option for consistent results in your output data sets.
Tip:	Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.

novinosrin · Posted 01-10-2019 12:03 PM

HI @Reeza and @skcussas, Yes Nodupkey is the way and best is to avoid noduprecs at all times like Reeza pointed out. I was making that mistake too for a long time until @mkeintz corrected me and made me learn the nit when we were having a discussion comparing nodupkey, select distinct and noduprecs in tandem. Hmm rings the bell 🙂 Cheers from me to Mark

So ,

noduprecs = select distinct * and not select distinct make ,type, origin

which apparently means the from table should only have the vars making it distinct and cannot support dataset option at execution time for noduprecs to work. Well, well what a nit!

proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

I still get duplicated rows with the modification you sug...

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Re: proc sort with nonduprecs does not work with data = ds(keep=)

Registration is open

SAS Training: Just a Click Away