Re: proc sort - noduprec vs nodupkey

Julie4435637 · Posted 04-11-2020 11:15 AM

I have a dataset (dr_codes with 9037 records and 9 variables) - I ran the 
following 2 proc sort and the results were different. I don't understand why 
the results are different.  I know that noduprec removes duplicates by all
variables in the dataset but I am only reading in 1 variable and therefore
believe it should behave like a nodupkey. I have attached a log.


1st - output dataset has 9037 records and 1 variable - 0 duplicate observations deleted
 proc sort data=dr_codes(keep=dr_id) out=dr_codes_noduprec noduprec;
   by dr_id;
run;

2nd - output dataset has 8719 records and 1 variable - 318 duplicated key values deleted
proc sort data=dr_codes(keep=dr_id) out=dr_codes_nodupkey nodupkey;
   by dr_id;
run;

ed_sas_member · Posted 04-11-2020 11:31 AM

Hi @Julie4435637

Here is an interesting paper about this topic: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf

With the NODUPKEY option, PROC SORT is comparing all BY variable values while the NODUPREC option compares all the variables in the data set that is being sorted.

An easy way to remember the difference between these options is to keep in mind the word “key” in NODUPKEY.

Kind regards,

ChrisNZ · Posted 04-11-2020 08:53 PM

Also be mindful that NODUPREC compares *consecutive* records. You may well end up with duplicates if the identical records are not consecutive.

In this case, it's better to use

by _ALL_;

to ensure the deduplication is effected.

High-Performance SAS Coding - Third Edition

art297 · Posted 04-11-2020 09:49 PM

@Julie4435637 : Congratulations! You did what I've been trying to do unsuccessfully for over 46 years .. find a bug in SAS that can't be brushed off as being a feature.

@ChrisHemedinger : I can't submit anything to tech support, so I'd appreciate your doing the honors.

@Julie4435637 : I bet your dr_id field is a character (rather than numeric) field. I totally agree with you that noduprec and nodupkey should have given you identical results.

Here is an example, like yours I presume, where they don't behave as expected:

data have;
  input id $ x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;

proc sort data=have out=nodup noduprec;
  by id;
run;

proc sort data=have out=nodupkey nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

ChrisNZ · Posted 04-11-2020 10:30 PM

@art297 I am confused.

1. I don't see how your 2 sorts should output the same result.

2. Option NODUPREC needs consecutive record to work properly.

So you need to use: by ID X; or by ID _ALL_;

3. Lucky you. I seem to find defect after defect, month after month, year after year. Suffering @SimonDawson is often at the receiving end. 🙂

High-Performance SAS Coding - Third Edition

art297 · Posted 04-11-2020 11:00 PM

@Julie4435637 @ChrisNZ @ChrisHemedinger : First, my apologies, my example was bad.

I may be mistaken but, I think that the order of concern (for nodup (noduprec) is dependent upon the order AFTER the sort .. not the one before the sort.

I was also wrong regarding that the difference was between character and numeric by variables.

The problem is simply that if a file has more than one variable, and a keep option is used to only include the by variable, nodup (noduprec) performs differently than nodupkey .. when it shouldn't.

e.g.

data have_w_char_id;
  input id $;
  cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;

proc sort data=have_w_char_id (keep=id) out=c_id_nodup noduprec;
  by id;
run;

proc sort data=have_w_char_id (keep=id) out=c_id_nodupkey nodupkey;
  by id;
run;

data have_w_num_id;
  input id;
  cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;

proc sort data=have_w_num_id (keep=id) out=n_id_nodup noduprec;
  by id;
run;

proc sort data=have_w_num_id (keep=id) out=n_id_nodupkey nodupkey;
  by id;
run;

data have_w_char_id2;
  input id $ x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;

proc sort data=have_w_char_id2 (keep=id) out=c_id_nodup2 noduprec;
  by id;
run;

proc sort data=have_w_char_id2 (keep=id) out=c_id_nodupkey2 nodupkey;
  by id;
run;

data have_w_num_id2;
  input id x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;

proc sort data=have_w_num_id2 (keep=id) out=n_id_nodup2 noduprec;
  by id;
run;

proc sort data=have_w_num_id2 (keep=id) out=n_id_nodupkey2 nodupkey;
  by id;
run;

Art, CEO, AnalystFinder.com

ChrisNZ · Posted 04-11-2020 11:31 PM

@art297 I am with you now.

This option may be what you are after:

SORTDUP= PHYSICAL | LOGICAL

Syntax Description

PHYSICAL

removes duplicates based on all the variables that are present in the data set.

LOGICAL

removes duplicates based on only the variables remaining after the DROP= and KEEP= data set options are processed.

High-Performance SAS Coding - Third Edition

art297 · Posted 04-12-2020 09:23 AM

@ChrisNZ : While my quest for finding my first bug must continue, that was definitely the key to solving @Julie4435637's problem.

The actual documentation can be found at: https://documentation.sas.com/?docsetId=lesysoptsref&docsetTarget=p0enmy04p8nukmn1cjpdcqku455s.htm&d...

Thus, running: options sortdup=logical;

turned this into a feature rather than a bug. I never confronted it as I never found a reason to use nodup over nodupkey.

Art, CEO, AnalystFinder.com

Julie4435637 · Posted 04-12-2020 11:46 AM

@art297 Thank you - the way SAS has it set up is very counter intutive. I expected SAS to behave the same way as nodupkey since the only variable I am interested in was the dr_id variable. I reviewed the documentation and tested the following code. I expected the 1st proc sort to behave like yesterday's code - but it deleted the duplicate observation. So do you see the difference? I attahed the log.

data tmp;

input avar $1.;

cards;

a

b

c

a

;;;;

run;

data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;

proc sort noduprecs data=tmp(keep=avar) out=_keep;
   by avar;
run;

proc sort noduprecs data=tmp out=_nokeep;
   by avar;
run;

proc sort noduprecs data=tmp(keep=avar) out=_keep;

by avar;

run;

proc sort noduprecs data=tmp out=_nokeep;

by avar;

run;

data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;

proc sort noduprecs data=tmp(keep=avar) out=_keep;
   by avar;
run;

proc sort noduprecs data=tmp out=_nokeep;
   by avar;
run;

data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;

proc sort noduprecs data=tmp(keep=avar) out=_keep;
   by avar;
run;

proc sort noduprecs data=tmp out=_nokeep;
   by avar;
run;

art297 · Posted 04-12-2020 11:59 AM

@Julie4435637 : Since the file only has one variable, logical and physical are the same in that the keep option doesn't do anything in this case. My personal recommendation is simply to ALWAYS use proc sort's nodupkey option when you need to remove duplicates.

Art, CEO, AnalystFinder.com

ChrisNZ · Posted 04-12-2020 09:05 PM

The default should be sortdup=logical.

Not only is sortdup=physical less intuitive, it is also inefficient as it lets PROC SORT load more variables than probably intended.

data t; 
  length X $1024;
  do I=1e6 to 1 by -1;
    output;
  end; 
run;
options sortdup=physical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run;
options sortdup=logical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run;

 79         options sortdup=physical ;
 80         proc sort data=T(keep=I) out=t1 noduprec; by I; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 1 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           25.45 seconds
       cpu time            2.56 seconds
       
 
 81         options sortdup=logical ;
 82         proc sort data=T(keep=I) out=t1 noduprec; by I; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 1 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           15.61 seconds
       cpu time            0.85 seconds

High-Performance SAS Coding - Third Edition

Quentin · Posted 04-12-2020 09:40 PM

Personally, I think the NODUPRECS option should be avoided. It does not provide useful functionality beyond what can be done with NODUPKEY and BY _ALL_, and of course it does not do the same thing as NODUPKEY BY _ALL_, which is surprising.

I hadn't known before about this interesting SORTDUP=physical|logical option, but the fact that NODUPRECS would not delete observations that are not consecutive was enough for me to avoid it.

I don't see any mention of the NODUPRECS option in the current docs:

https://documentation.sas.com/?docsetId=proc&docsetTarget=p02bhn81rn4u64n1b6l00ftdnxge.htm&docsetVer...

So I would think of NODUPRECS as deprecated (at best).

Is there a case where NODUPRECS can accomplish something useful which can not be accomplished via NODUPKEY?

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

ChrisNZ · Posted 04-13-2020 04:20 AM

Mmm interesting.

This option is still mentioned elsewhere, but this could indeed indicate that it's on its way to deprecation.

https://documentation.sas.com/?docsetId=proc&docsetVersion=9.4&docsetTarget=p1sy9ca8n2tv03n1savk4p9c...

https://documentation.sas.com/?docsetId=lrcon&docsetTarget=n0flzgu1cpvz16n1xvwchd0ogsut.htm&docsetVe...

https://documentation.sas.com/?docsetId=proc&docsetTarget=p1r64hrox6o1j6n1oxtbeey2xfm0.htm&docsetVer...

I'll tell the documentation people about this apparent inconsistency.

I agree with you, its utility is dubious, and it's got too many gotchas (consecutive records needed, and logical/physical variables).

One positive: It's much faster than running a by _ALL_, if you know that your data is suitable with these limitations.

data T; 
  array V[100];
  do I =1 to 1e6; 
    V[1]=ranuni(0);
    output;
  end;
proc sort data =T out=T1 nodupkey; by V1 _all_; run; 
proc sort data =T out=T1 noduprec; by V1 ;      run;

74           proc sort data =T out=T1 nodupkey; by V1 _all_; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
 NOTE: 0 observations with duplicate key values were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 102 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           2:54.55
       cpu time            14.09 seconds
       
 
 75           proc sort data =T out=T1 noduprec; by V1 ; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 102 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           24.78 seconds
       cpu time            3.41 seconds

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-21-2020 07:30 AM

> I'll tell the documentation people about this apparent inconsistency.

Here is the update:

In PROC SORT that option was removed before 9.4. If you are searching in the source code, it appears in comments for historical information only and does not show up in the PROC SORT documentation.

[To the] point that it should have been mentioned as deprecated [] I searched the 9.2 and 9.3 versions of the What’s New and that was missed. Thank you for letting us know that it was still mentioned a few other places. It is definitely an option that we do not want to use.

So it seems that our reservations about this option are shared, and the option is now gone.

High-Performance SAS Coding - Third Edition

Quentin · Posted 04-21-2020 08:33 AM

Thanks for following up with tech support and sharing their response, @ChrisNZ . I'm happy to see that NODUPRECS is "officially" deprecated. I'd wager that 1/2 of the PROC SORT questions on SAS-L over the generations were about the oddity of NODUPRECS not actually removing all duplicated records. Good riddance to NODUPRECS. : )

BASUG is hosting free webinars Next up: Don Henderson presenting on using hash functions (not hash tables!) to segment data on June 12. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

Syntax Description

PHYSICAL

LOGICAL

Classroom Training Available!