SAS Programming

DATA Step, Macro, Functions and more
BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Kastchei
Pyrite | Level 9

Howdy!

 

I am trying to use the PRESORTED option of PROC SORT to avoid multiple, very lengthy (~10 minutes) sorts.  However, I am finding that SAS sometimes does not accept that a sorted dataset is truly sorted when also using NoDupRec.

 

My dataset, labs, I have already sorted by all variables.  When I sort it with no options, SAS tells me it's already sorted.  When I add PRESORTED, it tells me it's already sorted.  But when I add PRESORTED and NoDupRec, it tells me the data is not sorted.  However, the meta data does show the data sorted (of course).  Does anyone know what's going on?

1125  proc sort data = labs;
1126      by _all_;
1127  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        50  Switch Count  0


1128  proc sort data = labs presorted;
1129      by _all_;
1130  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1456.12k
      OS Memory           26644.00k
      Timestamp           03/26/2025 10:21:41 PM
      Step Count                        51  Switch Count  0


1131  proc sort data = labs presorted noDupRec;
1132      by _all_;
1133  run;

NOTE: Input data set is not in sorted order.

When I tried a super simple example, it works fine.

1143  proc sort data = sasHelp.class out = class;
1144      by _all_;
1145  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: SAS sort was used.
NOTE: The data set WORK.CLASS has 19 observations and 5 variables.
NOTE: Compressing data set WORK.CLASS increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.01 seconds
      user cpu time       0.00 seconds
      system cpu time     0.01 seconds
      memory              1213.43k
      OS Memory           26636.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        56  Switch Count  0


1146  proc sort data = class presorted;
1147      by _all_;
1148  run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        57  Switch Count  0


1149  proc sort data = class presorted noDupRec;
1150      by _all_;
1151  run;

NOTE: Sort order of input data set has been verified.
NOTE: There were 19 observations read from the data set WORK.CLASS.
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      user cpu time       0.00 seconds
      system cpu time     0.00 seconds
      memory              328.28k
      OS Memory           25860.00k
      Timestamp           03/26/2025 10:23:46 PM
      Step Count                        58  Switch Count  0

Thanks,

Michael

 

1 ACCEPTED SOLUTION

Accepted Solutions
FreelanceReinh
Jade | Level 19

Hello @Kastchei,

 

I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:

If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.

Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).

 

The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.

View solution in original post

6 REPLIES 6
Ksharp
Super User
Try to use "noDupKey" ,not "noDupRec" which is objected in reality.
Kastchei
Pyrite | Level 9

Sadly, the issue persists with NoDupKey, too.

andreas_lds
Jade | Level 19

imho noduprec should not be used at all - see https://support.sas.com/kb/1/566.html for details.

Kastchei
Pyrite | Level 9

The same happens with NoDupKey.  I guess the deduplicating process must not properly verify the presorted order.

FreelanceReinh
Jade | Level 19

Hello @Kastchei,

 

I think what you've observed is in accordance with the documentation. In the fourth paragraph of section "Presorted Input Data Sets" of the SORT procedure concepts it says about what happens if the NODUPKEY option is used:

If observations with duplicate keys are found, then the data set is considered unsorted and a sort is performed.

Most likely, your LABS dataset contains such duplicates (unlike dataset CLASS in your second example) and this is what the note "Input data set is not in sorted order" in the log indicates (although the wording is not ideal).

 

The question remains whether removing those duplicates from an actually presorted dataset with PROC SORT ... NODUPKEY is much more resource-intensive than doing so with a DATA step using a BY statement and a subsetting IF (if first.<last BY variable in the list>;). If in doubt, I would use the DATA step.

Kastchei
Pyrite | Level 9
Thank you! I had overlooked this in the documentation. That is unfortunate. I figured if told it was sorted, it would just do the deduplication part. As it stands, it takes as long to deduplicate with presorted as it did to sort the dataset in the first place. I guess deduplication is more costly than I figured it would be.

I completely overlooked the first. data step route. I bet it is faster. I will try!

(Just FYI, DISTINCT in SQL couldn't complete, because the sort there took too many resources.)

sas-innovate-white.png

Join us for our biggest event of the year!

Four days of inspiring keynotes, product reveals, hands-on learning opportunities, deep-dive demos, and peer-led breakouts. Don't miss out, May 6-9, in Orlando, Florida.

 

View the full agenda.

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 312 views
  • 4 likes
  • 4 in conversation