I have a dataset (dr_codes with 9037 records and 9 variables) - I ran the
following 2 proc sort and the results were different. I don't understand why
the results are different. I know that noduprec removes duplicates by all
variables in the dataset but I am only reading in 1 variable and therefore
believe it should behave like a nodupkey. I have attached a log.
1st - output dataset has 9037 records and 1 variable - 0 duplicate observations deleted
proc sort data=dr_codes(keep=dr_id) out=dr_codes_noduprec noduprec;
by dr_id;
run;
2nd - output dataset has 8719 records and 1 variable - 318 duplicated key values deleted
proc sort data=dr_codes(keep=dr_id) out=dr_codes_nodupkey nodupkey;
by dr_id;
run;
Here is an interesting paper about this topic: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf
With the NODUPKEY option, PROC SORT is comparing all BY variable values while the NODUPREC option compares all the variables in the data set that is being sorted.
An easy way to remember the difference between these options is to keep in mind the word “key” in NODUPKEY.
Kind regards,
Also be mindful that NODUPREC compares *consecutive* records. You may well end up with duplicates if the identical records are not consecutive.
In this case, it's better to use
by _ALL_;
to ensure the deduplication is effected.
@Julie4435637 : Congratulations! You did what I've been trying to do unsuccessfully for over 46 years .. find a bug in SAS that can't be brushed off as being a feature.
@ChrisHemedinger : I can't submit anything to tech support, so I'd appreciate your doing the honors.
@Julie4435637 : I bet your dr_id field is a character (rather than numeric) field. I totally agree with you that noduprec and nodupkey should have given you identical results.
Here is an example, like yours I presume, where they don't behave as expected:
data have;
  input id $ x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have out=nodup noduprec;
  by id;
run;
proc sort data=have out=nodupkey nodupkey;
  by id;
run;
Art, CEO, AnalystFinder.com
@art297 I am confused.
1. I don't see how your 2 sorts should output the same result.
2. Option NODUPREC needs consecutive record to work properly.
So you need to use: by ID X; or by ID _ALL_;
3. Lucky you. I seem to find defect after defect, month after month, year after year. Suffering @SimonDawson is often at the receiving end. 🙂
@Julie4435637 @ChrisNZ @ChrisHemedinger : First, my apologies, my example was bad.
I may be mistaken but, I think that the order of concern (for nodup (noduprec) is dependent upon the order AFTER the sort .. not the one before the sort.
I was also wrong regarding that the difference was between character and numeric by variables.
The problem is simply that if a file has more than one variable, and a keep option is used to only include the by variable, nodup (noduprec) performs differently than nodupkey .. when it shouldn't.
e.g.
data have_w_char_id;
  input id $;
  cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;
proc sort data=have_w_char_id (keep=id) out=c_id_nodup noduprec;
  by id;
run;
proc sort data=have_w_char_id (keep=id) out=c_id_nodupkey nodupkey;
  by id;
run;
data have_w_num_id;
  input id;
  cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;
proc sort data=have_w_num_id (keep=id) out=n_id_nodup noduprec;
  by id;
run;
proc sort data=have_w_num_id (keep=id) out=n_id_nodupkey nodupkey;
  by id;
run;
data have_w_char_id2;
  input id $ x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have_w_char_id2 (keep=id) out=c_id_nodup2 noduprec;
  by id;
run;
proc sort data=have_w_char_id2 (keep=id) out=c_id_nodupkey2 nodupkey;
  by id;
run;
data have_w_num_id2;
  input id x;
  cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have_w_num_id2 (keep=id) out=n_id_nodup2 noduprec;
  by id;
run;
proc sort data=have_w_num_id2 (keep=id) out=n_id_nodupkey2 nodupkey;
  by id;
run;
Art, CEO, AnalystFinder.com
@art297 I am with you now.
This option may be what you are after:
removes duplicates based on all the variables that are present in the data set.
removes duplicates based on only the variables remaining after the DROP= and KEEP= data set options are processed.
@ChrisNZ : While my quest for finding my first bug must continue, that was definitely the key to solving @Julie4435637's problem.
The actual documentation can be found at: https://documentation.sas.com/?docsetId=lesysoptsref&docsetTarget=p0enmy04p8nukmn1cjpdcqku455s.htm&d...
Thus, running: options sortdup=logical;
turned this into a feature rather than a bug. I never confronted it as I never found a reason to use nodup over nodupkey.
Art, CEO, AnalystFinder.com
@art297 Thank you - the way SAS has it set up is very counter intutive. I expected SAS to behave the same way as nodupkey since the only variable I am interested in was the dr_id variable. I reviewed the documentation and tested the following code. I expected the 1st proc sort to behave like yesterday's code - but it deleted the duplicate observation. So do you see the difference? I attahed the log.
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
@Julie4435637 : Since the file only has one variable, logical and physical are the same in that the keep option doesn't do anything in this case. My personal recommendation is simply to ALWAYS use proc sort's nodupkey option when you need to remove duplicates.
Art, CEO, AnalystFinder.com
The default should be sortdup=logical.
Not only is sortdup=physical less intuitive, it is also inefficient as it lets PROC SORT load more variables than probably intended.
data t; 
  length X $1024;
  do I=1e6 to 1 by -1;
    output;
  end; 
run;
options sortdup=physical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run;
options sortdup=logical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run; 79         options sortdup=physical ;
 80         proc sort data=T(keep=I) out=t1 noduprec; by I; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 1 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           25.45 seconds
       cpu time            2.56 seconds
       
 
 81         options sortdup=logical ;
 82         proc sort data=T(keep=I) out=t1 noduprec; by I; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 1 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           15.61 seconds
       cpu time            0.85 seconds
       
Personally, I think the NODUPRECS option should be avoided. It does not provide useful functionality beyond what can be done with NODUPKEY and BY _ALL_, and of course it does not do the same thing as NODUPKEY BY _ALL_, which is surprising.
I hadn't known before about this interesting SORTDUP=physical|logical option, but the fact that NODUPRECS would not delete observations that are not consecutive was enough for me to avoid it.
I don't see any mention of the NODUPRECS option in the current docs:
So I would think of NODUPRECS as deprecated (at best).
Is there a case where NODUPRECS can accomplish something useful which can not be accomplished via NODUPKEY?
Mmm interesting.
This option is still mentioned elsewhere, but this could indeed indicate that it's on its way to deprecation.
I'll tell the documentation people about this apparent inconsistency.
I agree with you, its utility is dubious, and it's got too many gotchas (consecutive records needed, and logical/physical variables).
One positive: It's much faster than running a by _ALL_, if you know that your data is suitable with these limitations.
data T; 
  array V[100];
  do I =1 to 1e6; 
    V[1]=ranuni(0);
    output;
  end;
proc sort data =T out=T1 nodupkey; by V1 _all_; run; 
proc sort data =T out=T1 noduprec; by V1 ;      run;  74           proc sort data =T out=T1 nodupkey; by V1 _all_; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored.
 NOTE: 0 observations with duplicate key values were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 102 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           2:54.55
       cpu time            14.09 seconds
       
 
 75           proc sort data =T out=T1 noduprec; by V1 ; run;
 
 NOTE: There were 1000000 observations read from the data set WORK.T.
 NOTE: 0 duplicate observations were deleted.
 NOTE: The data set WORK.T1 has 1000000 observations and 102 variables.
 NOTE: PROCEDURE SORT used (Total process time):
       real time           24.78 seconds
       cpu time            3.41 seconds
> I'll tell the documentation people about this apparent inconsistency.
Here is the update:
In PROC SORT that option was removed before 9.4. If you are searching in the source code, it appears in comments for historical information only and does not show up in the PROC SORT documentation.
[To the] point that it should have been mentioned as deprecated [] I searched the 9.2 and 9.3 versions of the What’s New and that was missed. Thank you for letting us know that it was still mentioned a few other places. It is definitely an option that we do not want to use.
So it seems that our reservations about this option are shared, and the option is now gone.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
