Hi everyone,
I'm trying to find the non duplicated rows of certain columns from a data set using proc sort.
For example, finding the unique combination of MAKE, TYPE, and ORIGIN in sashelp.cars
Here is my code
proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE;
run;
However, the resulting data set still contains duplicated rows.
Can anyone explain to me why my code doesn't work as expect?
Thanks for your help.
Nodupkey works
proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;
Try adding origin to your by statement
proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE ORIGIN;
run;
Nodupkey works
proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;
I really hope they fix this.
Looks like PROC SORT does not honor the KEEP the way one might expect. You can use a view to get what you want.
182 proc sort data=sashelp.cars(keep=MAKE TYPE ORIGIN) out=dsout noduprec;
183 by MAKE TYPE;
184 run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 428 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
185
186 data carsV / view=carsV;
187 set sashelp.cars(keep=MAKE TYPE ORIGIN);
188 run;
NOTE: DATA STEP view saved on file WORK.CARSV.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
189 proc sort data=carsV out=dsout noduprec;
190 by MAKE TYPE;
191 run;
NOTE: There were 428 observations read from the data set WORK.CARSV.
NOTE: View WORK.CARSV.VIEW used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
Guru , I honestly didn't see your message. Forgive me for the near duplicate message besides your view making it distinct. Sorry
I agree that option NODUPREC must be used carefully, as @Reeza show us is outlined in the documentation.
The issues of option KEEP= not working as expected, as @data_null__ demonstrates, seems at odds with what this option is supposed to mean.
Shouldn't this be seen as a defect?
@skcussas You should start by reading the documentation before asking a question. This behaviour of option NODUPREC is nothing new.
For noduprecs to work, I am afraid you need a datastep as it's apparent keep= dataset option in the proc sort data= doesn;t compile as expected.
So here's the work around
85 data w;
86 set sashelp.cars;
87 keep make type origin;
88 run;
NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.W has 428 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
89 proc sort data=w out=w1 noduprecs;
90 by _all_;
91 run;
NOTE: There were 428 observations read from the data set WORK.W.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.W1 has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
NODUPREC is not included in the SAS 9.4 documentation for PROC SORT. I do believe it has been removed due to this common issue.
It was likely left in for backward compatibility.
From the 9.2 version of the documentation, this behaviour is explicitly mentioned:
NODUPRECS
checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to the ones for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.
Note: See NODUPKEY for information about eliminating observations with duplicate BY values.
Alias : | NODUP |
Interaction: | When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed. |
Interaction: | The action of NODUPRECS is directly related to the setting of the SORTDUP= system option. When SORTDUP= is set to LOGICAL, NODUPRECS removes duplicate observations based on the examination of the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed, because it eliminates variables before observation comparisons take place. Also, setting SORTDUP=LOGICAL can improve performance, because dropping variables before sorting reduces the amount of memory required to perform the sort. When SORTDUP= is set to PHYSICAL, NODUPRECS examines all variables in the data set, regardless of whether they have been kept or dropped. For more information about SORTDUP=, see the chapter on SAS system options in SAS Language Reference: Dictionary. |
Interaction: | In-database processing does not occur when the NODUPRECS option is specified. However, if the NODUPRECS and NODUPKEY options are specified, system option SQLGENERATION= set for in-database processing, and system option SORTPGM=BEST, the NODUPRECS option is ignored and in-database processing does occur. |
Tip: | Use the EQUALS option with the NODUPRECS option for consistent results in your output data sets. |
Tip: | Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables. |
HI @Reeza and @skcussas, Yes Nodupkey is the way and best is to avoid noduprecs at all times like Reeza pointed out. I was making that mistake too for a long time until @mkeintz corrected me and made me learn the nit when we were having a discussion comparing nodupkey, select distinct and noduprecs in tandem. Hmm rings the bell 🙂 Cheers from me to Mark
So ,
noduprecs = select distinct * and not select distinct make ,type, origin
which apparently means the from table should only have the vars making it distinct and cannot support dataset option at execution time for noduprecs to work. Well, well what a nit!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.