- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Howdy!
I am trying to use the PRESORTED option of PROC SORT to avoid multiple, very lengthy (~10 minutes) sorts. However, I am finding that SAS sometimes does not accept that a sorted dataset is truly sorted when also using NoDupRec.
My dataset, labs, I have already sorted by all variables. When I sort it with no options, SAS tells me it's already sorted. When I add PRESORTED, it tells me it's already sorted. But when I add PRESORTED and NoDupRec, it tells me the data is not sorted. However, the meta data does show the data sorted (of course). Does anyone know what's going on?
1125 proc sort data = labs;
1126 by _all_;
1127 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 1456.12k
OS Memory 26644.00k
Timestamp 03/26/2025 10:21:41 PM
Step Count 50 Switch Count 0
1128 proc sort data = labs presorted;
1129 by _all_;
1130 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
memory 1456.12k
OS Memory 26644.00k
Timestamp 03/26/2025 10:21:41 PM
Step Count 51 Switch Count 0
1131 proc sort data = labs presorted noDupRec;
1132 by _all_;
1133 run;
NOTE: Input data set is not in sorted order.
When I tried a super simple example, it works fine.
1143 proc sort data = sasHelp.class out = class;
1144 by _all_;
1145 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: SAS sort was used.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WORK.CLASS increased size by 100.00 percent.
Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.01 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
memory 1213.43k
OS Memory 26636.00k
Timestamp 03/26/2025 10:23:46 PM
Step Count 56 Switch Count 0
1146 proc sort data = class presorted;
1147 by _all_;
1148 run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 328.28k
OS Memory 25860.00k
Timestamp 03/26/2025 10:23:46 PM
Step Count 57 Switch Count 0
1149 proc sort data = class presorted noDupRec;
1150 by _all_;
1151 run;
NOTE: Sort order of input data set has been verified.
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 328.28k
OS Memory 25860.00k
Timestamp 03/26/2025 10:23:46 PM
Step Count 58 Switch Count 0
Thanks,
Michael
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Kastchei,
I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:
If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.
Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).
The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Sadly, the issue persists with NoDupKey, too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
imho noduprec should not be used at all - see https://support.sas.com/kb/1/566.html for details.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The same happens with NoDupKey. I guess the deduplicating process must not properly verify the presorted order.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello @Kastchei,
I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:
If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.
Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).
The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I completely overlooked the first. data step route. I bet it is faster. I will try!
(Just FYI, DISTINCT in SQL couldn't complete, because the sort there took too many resources.)