- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have a dataset (dr_codes with 9037 records and 9 variables) - I ran the
following 2 proc sort and the results were different. I don't understand why
the results are different. I know that noduprec removes duplicates by all
variables in the dataset but I am only reading in 1 variable and therefore
believe it should behave like a nodupkey. I have attached a log.
1st - output dataset has 9037 records and 1 variable - 0 duplicate observations deleted
proc sort data=dr_codes(keep=dr_id) out=dr_codes_noduprec noduprec;
by dr_id;
run;
2nd - output dataset has 8719 records and 1 variable - 318 duplicated key values deleted
proc sort data=dr_codes(keep=dr_id) out=dr_codes_nodupkey nodupkey;
by dr_id;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Here is an interesting paper about this topic: https://support.sas.com/resources/papers/proceedings/proceedings/sugi30/037-30.pdf
With the NODUPKEY option, PROC SORT is comparing all BY variable values while the NODUPREC option compares all the variables in the data set that is being sorted.
An easy way to remember the difference between these options is to keep in mind the word “key” in NODUPKEY.
Kind regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Also be mindful that NODUPREC compares *consecutive* records. You may well end up with duplicates if the identical records are not consecutive.
In this case, it's better to use
by _ALL_;
to ensure the deduplication is effected.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Julie4435637 : Congratulations! You did what I've been trying to do unsuccessfully for over 46 years .. find a bug in SAS that can't be brushed off as being a feature.
@ChrisHemedinger : I can't submit anything to tech support, so I'd appreciate your doing the honors.
@Julie4435637 : I bet your dr_id field is a character (rather than numeric) field. I totally agree with you that noduprec and nodupkey should have given you identical results.
Here is an example, like yours I presume, where they don't behave as expected:
data have;
input id $ x;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have out=nodup noduprec;
by id;
run;
proc sort data=have out=nodupkey nodupkey;
by id;
run;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@art297 I am confused.
1. I don't see how your 2 sorts should output the same result.
2. Option NODUPREC needs consecutive record to work properly.
So you need to use: by ID X; or by ID _ALL_;
3. Lucky you. I seem to find defect after defect, month after month, year after year. Suffering @SimonDawson is often at the receiving end. 🙂
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Julie4435637 @ChrisNZ @ChrisHemedinger : First, my apologies, my example was bad.
I may be mistaken but, I think that the order of concern (for nodup (noduprec) is dependent upon the order AFTER the sort .. not the one before the sort.
I was also wrong regarding that the difference was between character and numeric by variables.
The problem is simply that if a file has more than one variable, and a keep option is used to only include the by variable, nodup (noduprec) performs differently than nodupkey .. when it shouldn't.
e.g.
data have_w_char_id;
input id $;
cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;
proc sort data=have_w_char_id (keep=id) out=c_id_nodup noduprec;
by id;
run;
proc sort data=have_w_char_id (keep=id) out=c_id_nodupkey nodupkey;
by id;
run;
data have_w_num_id;
input id;
cards;
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
1
2
3
4
5
6
7
;
run;
proc sort data=have_w_num_id (keep=id) out=n_id_nodup noduprec;
by id;
run;
proc sort data=have_w_num_id (keep=id) out=n_id_nodupkey nodupkey;
by id;
run;
data have_w_char_id2;
input id $ x;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have_w_char_id2 (keep=id) out=c_id_nodup2 noduprec;
by id;
run;
proc sort data=have_w_char_id2 (keep=id) out=c_id_nodupkey2 nodupkey;
by id;
run;
data have_w_num_id2;
input id x;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
1 1
2 2
3 3
4 4
5 5
6 6
7 7
1 7
2 6
3 5
4 4
5 3
6 2
7 1
;
run;
proc sort data=have_w_num_id2 (keep=id) out=n_id_nodup2 noduprec;
by id;
run;
proc sort data=have_w_num_id2 (keep=id) out=n_id_nodupkey2 nodupkey;
by id;
run;
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@art297 I am with you now.
This option may be what you are after:
Syntax Description
PHYSICAL
removes duplicates based on all the variables that are present in the data set.
LOGICAL
removes duplicates based on only the variables remaining after the DROP= and KEEP= data set options are processed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@ChrisNZ : While my quest for finding my first bug must continue, that was definitely the key to solving @Julie4435637's problem.
The actual documentation can be found at: https://documentation.sas.com/?docsetId=lesysoptsref&docsetTarget=p0enmy04p8nukmn1cjpdcqku455s.htm&d...
Thus, running: options sortdup=logical;
turned this into a feature rather than a bug. I never confronted it as I never found a reason to use nodup over nodupkey.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@art297 Thank you - the way SAS has it set up is very counter intutive. I expected SAS to behave the same way as nodupkey since the only variable I am interested in was the dr_id variable. I reviewed the documentation and tested the following code. I expected the 1st proc sort to behave like yesterday's code - but it deleted the duplicate observation. So do you see the difference? I attahed the log.
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
data tmp;
input avar $1.;
cards;
a
b
c
a
;;;;
run;
proc sort noduprecs data=tmp(keep=avar) out=_keep;
by avar;
run;
proc sort noduprecs data=tmp out=_nokeep;
by avar;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Julie4435637 : Since the file only has one variable, logical and physical are the same in that the keep option doesn't do anything in this case. My personal recommendation is simply to ALWAYS use proc sort's nodupkey option when you need to remove duplicates.
Art, CEO, AnalystFinder.com
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The default should be sortdup=logical.
Not only is sortdup=physical less intuitive, it is also inefficient as it lets PROC SORT load more variables than probably intended.
data t;
length X $1024;
do I=1e6 to 1 by -1;
output;
end;
run;
options sortdup=physical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run;
options sortdup=logical ;
proc sort data=T(keep=I) out=T1 noduprec; by I; run;
79 options sortdup=physical ; 80 proc sort data=T(keep=I) out=t1 noduprec; by I; run; NOTE: There were 1000000 observations read from the data set WORK.T. NOTE: 0 duplicate observations were deleted. NOTE: The data set WORK.T1 has 1000000 observations and 1 variables. NOTE: PROCEDURE SORT used (Total process time): real time 25.45 seconds cpu time 2.56 seconds 81 options sortdup=logical ; 82 proc sort data=T(keep=I) out=t1 noduprec; by I; run; NOTE: There were 1000000 observations read from the data set WORK.T. NOTE: 0 duplicate observations were deleted. NOTE: The data set WORK.T1 has 1000000 observations and 1 variables. NOTE: PROCEDURE SORT used (Total process time): real time 15.61 seconds cpu time 0.85 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Personally, I think the NODUPRECS option should be avoided. It does not provide useful functionality beyond what can be done with NODUPKEY and BY _ALL_, and of course it does not do the same thing as NODUPKEY BY _ALL_, which is surprising.
I hadn't known before about this interesting SORTDUP=physical|logical option, but the fact that NODUPRECS would not delete observations that are not consecutive was enough for me to avoid it.
I don't see any mention of the NODUPRECS option in the current docs:
So I would think of NODUPRECS as deprecated (at best).
Is there a case where NODUPRECS can accomplish something useful which can not be accomplished via NODUPKEY?
Next up: SAS Trivia Quiz hosted by SAS on Wednesday May 21.
Register now at https://www.basug.org/events.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Mmm interesting.
This option is still mentioned elsewhere, but this could indeed indicate that it's on its way to deprecation.
I'll tell the documentation people about this apparent inconsistency.
I agree with you, its utility is dubious, and it's got too many gotchas (consecutive records needed, and logical/physical variables).
One positive: It's much faster than running a by _ALL_, if you know that your data is suitable with these limitations.
data T;
array V[100];
do I =1 to 1e6;
V[1]=ranuni(0);
output;
end;
proc sort data =T out=T1 nodupkey; by V1 _all_; run;
proc sort data =T out=T1 noduprec; by V1 ; run;
74 proc sort data =T out=T1 nodupkey; by V1 _all_; run; NOTE: There were 1000000 observations read from the data set WORK.T. NOTE: Duplicate BY variable(s) specified. Duplicates will be ignored. NOTE: 0 observations with duplicate key values were deleted. NOTE: The data set WORK.T1 has 1000000 observations and 102 variables. NOTE: PROCEDURE SORT used (Total process time): real time 2:54.55 cpu time 14.09 seconds 75 proc sort data =T out=T1 noduprec; by V1 ; run; NOTE: There were 1000000 observations read from the data set WORK.T. NOTE: 0 duplicate observations were deleted. NOTE: The data set WORK.T1 has 1000000 observations and 102 variables. NOTE: PROCEDURE SORT used (Total process time): real time 24.78 seconds cpu time 3.41 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
> I'll tell the documentation people about this apparent inconsistency.
Here is the update:
In PROC SORT that option was removed before 9.4. If you are searching in the source code, it appears in comments for historical information only and does not show up in the PROC SORT documentation.
[To the] point that it should have been mentioned as deprecated [] I searched the 9.2 and 9.3 versions of the What’s New and that was missed. Thank you for letting us know that it was still mentioned a few other places. It is definitely an option that we do not want to use.
So it seems that our reservations about this option are shared, and the option is now gone.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Next up: SAS Trivia Quiz hosted by SAS on Wednesday May 21.
Register now at https://www.basug.org/events.