Howdy!
I am trying to use the PRESORTED option of PROC SORT to avoid multiple, very lengthy (~10 minutes) sorts. However, I am finding that SAS sometimes does not accept that a sorted dataset is truly sorted when also using NoDupRec.
My dataset, labs, I have already sorted by all variables. When I sort it with no options, SAS tells me it's already sorted. When I add PRESORTED, it tells me it's already sorted. But when I add PRESORTED and NoDupRec, it tells me the data is not sorted. However, the meta data does show the data sorted (of course). Does anyone know what's going on?
1125  proc sort data = labs;
1126      by _all_;
1127  run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        50  Switch Count  0
1128  proc sort data = labs presorted;
1129      by _all_;
1130  run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        51  Switch Count  0
1131  proc sort data = labs presorted noDupRec;
1132      by _all_;
1133  run;
NOTE: Input data set is not in sorted order.
When I tried a super simple example, it works fine.
1143  proc sort data = sasHelp.class out = class;
1144      by _all_;
1145  run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: SAS sort was used.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WORK.CLASS increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1213.43k
      OS Memory           26636.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        56  Switch Count  0
1146  proc sort data = class presorted;
1147      by _all_;
1148  run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        57  Switch Count  0
1149  proc sort data = class presorted noDupRec;
1150      by _all_;
1151  run;
NOTE: Sort order of input data set has been verified.
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        58  Switch Count  0Thanks,
Michael
Hello @Kastchei,
I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:
If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.
Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).
The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.
Sadly, the issue persists with NoDupKey, too.
imho noduprec should not be used at all - see https://support.sas.com/kb/1/566.html for details.
The same happens with NoDupKey. I guess the deduplicating process must not properly verify the presorted order.
Hello @Kastchei,
I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:
If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.
Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).
The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
