BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
skcussas
Fluorite | Level 6

Hi everyone,

 

I'm trying to find the non duplicated rows of certain columns from a data set using proc sort.

For example, finding the unique combination of MAKE, TYPE, and ORIGIN in sashelp.cars

 

Here is my code 

 

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE;
run;

 

However, the resulting data set still contains duplicated rows.

Can anyone explain to me why my code doesn't work as expect?

Thanks for your help.

1 ACCEPTED SOLUTION

Accepted Solutions
novinosrin
Tourmaline | Level 20

Nodupkey works 

 

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;

View solution in original post

15 REPLIES 15
novinosrin
Tourmaline | Level 20

Try adding origin to your by statement

 

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout noduprec;
by MAKE TYPE ORIGIN;
run;
novinosrin
Tourmaline | Level 20

Nodupkey works 

 

proc sort data=sashelp.cars (keep=MAKE TYPE ORIGIN) out=dsout nodupkey;
by MAKE TYPE ORIGIN;
run;
skcussas
Fluorite | Level 6
Thanks nodupkeys works!
skcussas
Fluorite | Level 6
I still get duplicated rows with the modification you suggest
Reeza
Super User
Your data has to be sorted twice for NODUPREC to work correctly - it only removes duplicates in order. It's an annoying feature, I thought they'd actually removed it. Otherwise, you can use NODUPKEY to remove duplicates.

This is a common gotcha with NODUPRECS.
elolvido
Fluorite | Level 6

I really hope they fix this.

data_null__
Jade | Level 19

Looks like PROC SORT does not honor the KEEP the way one might expect.  You can use a view to get what you want.

 

182  proc sort data=sashelp.cars(keep=MAKE TYPE ORIGIN) out=dsout noduprec;
183     by MAKE TYPE;
184     run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 428 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


185
186  data carsV / view=carsV;
187     set sashelp.cars(keep=MAKE TYPE ORIGIN);
188     run;

NOTE: DATA STEP view saved on file WORK.CARSV.
NOTE: A stored DATA STEP view cannot run under a different operating system.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


189  proc sort data=carsV out=dsout noduprec;
190     by MAKE TYPE;
191     run;

NOTE: There were 428 observations read from the data set WORK.CARSV.
NOTE: View WORK.CARSV.VIEW used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.DSOUT has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.03 seconds
      cpu time            0.01 seconds
novinosrin
Tourmaline | Level 20

Guru , I honestly didn't see your message. Forgive me for the near duplicate message besides your view making it distinct. Sorry

ChrisNZ
Tourmaline | Level 20

I agree that option NODUPREC must be used carefully, as @Reeza show us is outlined in the documentation. 

The issues of option KEEP= not working as expected, as @data_null__ demonstrates, seems at odds with what this option is supposed to mean.

Shouldn't this be seen as a defect?

 

@skcussas You should start by reading the documentation before asking a question. This behaviour of option NODUPREC is nothing new.

novinosrin
Tourmaline | Level 20

For noduprecs to work, I am afraid you need a datastep as it's apparent keep= dataset option in the proc sort data= doesn;t compile as expected.

 

So here's the work around

 


85   data w;
86   set sashelp.cars;
87   keep make type origin;
88   run;

NOTE: There were 428 observations read from the data set SASHELP.CARS.
NOTE: The data set WORK.W has 428 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


89   proc sort data=w out=w1 noduprecs;
90   by _all_;
91   run;

NOTE: There were 428 observations read from the data set WORK.W.
NOTE: 314 duplicate observations were deleted.
NOTE: The data set WORK.W1 has 114 observations and 3 variables.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds
Reeza
Super User

NODUPREC is not included in the SAS 9.4 documentation for PROC SORT. I do believe it has been removed due to this common issue. 

 

It was likely left in for backward compatibility.

 

https://documentation.sas.com/?docsetId=proc&docsetTarget=p02bhn81rn4u64n1b6l00ftdnxge.htm&docsetVer...

 

From the 9.2 version of the documentation, this behaviour is explicitly mentioned:

 

NODUPRECS

checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to the ones for the previous observation that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Note:   See NODUPKEY for information about eliminating observations with duplicate BY values.  [cautionend]

Alias : NODUP
Interaction: When you are removing consecutive duplicate observations in the output data set with NODUPRECS, the choice of EQUALS or NOEQUALS can have an effect on which observations are removed.
Interaction: The action of NODUPRECS is directly related to the setting of the SORTDUP= system option. When SORTDUP= is set to LOGICAL, NODUPRECS removes duplicate observations based on the examination of the variables that remain after a DROP or KEEP operation on the input data set. Setting SORTDUP=LOGICAL increases the number of duplicate observations that are removed, because it eliminates variables before observation comparisons take place. Also, setting SORTDUP=LOGICAL can improve performance, because dropping variables before sorting reduces the amount of memory required to perform the sort. When SORTDUP= is set to PHYSICAL, NODUPRECS examines all variables in the data set, regardless of whether they have been kept or dropped. For more information about SORTDUP=, see the chapter on SAS system options in SAS Language Reference: Dictionary.
Interaction: In-database processing does not occur when the NODUPRECS option is specified. However, if the NODUPRECS and NODUPKEY options are specified, system option SQLGENERATION= set for in-database processing, and system option SORTPGM=BEST, the NODUPRECS option is ignored and in-database processing does occur.
Tip: Use the EQUALS option with the NODUPRECS option for consistent results in your output data sets.
Tip: Because NODUPRECS checks only consecutive observations, some nonconsecutive duplicate observations might remain in the output data set. You can remove all duplicates with this option by sorting on all variables.

 

novinosrin
Tourmaline | Level 20

HI @Reeza and @skcussas, Yes Nodupkey is the way and best is to avoid noduprecs at all times like Reeza pointed out. I was making that mistake too for a long time until @mkeintz corrected me and made me learn the nit when we were having a discussion comparing nodupkey, select distinct and noduprecs in tandem. Hmm rings the bell 🙂 Cheers from me  to Mark 

 

 

So ,

 

noduprecs =  select distinct * and not select distinct make ,type, origin

 

which apparently means the from table should only have the vars making it distinct and cannot support dataset option at execution time for noduprecs to work. Well, well what a nit!

 

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 3740 views
  • 13 likes
  • 6 in conversation