Solved: Re: Presorted not working with NoDupRec

Kastchei · Posted 03-26-2025 10:27 PM

Howdy!

I am trying to use the PRESORTED option of PROC SORT to avoid multiple, very lengthy (~10 minutes) sorts. However, I am finding that SAS sometimes does not accept that a sorted dataset is truly sorted when also using NoDupRec.

My dataset, labs, I have already sorted by all variables. When I sort it with no options, SAS tells me it's already sorted. When I add PRESORTED, it tells me it's already sorted. But when I add PRESORTED and NoDupRec, it tells me the data is not sorted. However, the meta data does show the data sorted (of course). Does anyone know what's going on?

1125  proc sort data = labs;
1126      by _all_;
1127  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        50  Switch Count  0


1128  proc sort data = labs presorted;
1129      by _all_;
1130  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        51  Switch Count  0


1131  proc sort data = labs presorted noDupRec;
1132      by _all_;
1133  run;

NOTE: Input data set is not in sorted order.

When I tried a super simple example, it works fine.

1143  proc sort data = sasHelp.class out = class;
1144      by _all_;
1145  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: SAS sort was used.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WORK.CLASS increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1213.43k
      OS Memory           26636.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        56  Switch Count  0


1146  proc sort data = class presorted;
1147      by _all_;
1148  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        57  Switch Count  0


1149  proc sort data = class presorted noDupRec;
1150      by _all_;
1151  run;

NOTE: Sort order of input data set has been verified.
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        58  Switch Count  0

Thanks,

Michael

FreelanceReinh · Posted 03-27-2025 01:42 PM

Hello @Kastchei,

I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:

If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.

Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).

The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.

View solution in original post

Ksharp · Posted 03-26-2025 10:53 PM

Try to use "noDupKey" ,not "noDupRec" which is objected in reality.

Kastchei · Posted 03-27-2025 09:47 AM

Sadly, the issue persists with NoDupKey, too.

andreas_lds · Posted 03-27-2025 08:57 AM

imho noduprec should not be used at all - see https://support.sas.com/kb/1/566.html for details.

Kastchei · Posted 03-27-2025 09:49 AM

The same happens with NoDupKey. I guess the deduplicating process must not properly verify the presorted order.

FreelanceReinh · Posted 03-27-2025 01:42 PM

Hello @Kastchei,

I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:

If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.

Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).

The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.

Kastchei · Posted 03-27-2025 03:21 PM

Thank you! I had overlooked this in the documentation. That is unfortunate. I figured if told it was sorted, it would just do the deduplication part. As it stands, it takes as long to deduplicate with presorted as it did to sort the dataset in the first place. I guess deduplication is more costly than I figured it would be.

I completely overlooked the first. data step route. I bet it is faster. I will try!

(Just FYI, DISTINCT in SQL couldn't complete, because the sort there took too many resources.)

SAS Training: Just a Click Away