You have already had quite a lot of advice. However, there is one other reason checking with a data step may be the faster way - say we have a lot of variables, but only want to check sorting on a few of them, then you can use KEEP= to minimize the data that is actually read.
May save quite a lot of memory, and possibly also some time, e.g.:
%let sorted=NO;
data _null_;
do until(eof);
set sashelp.class(keep=age) end=eof;
by age;
end;
call symputx('SORTED','YES');
run;
%put &=sorted;
I put the DO UNTIL in because it saves the code to reset the variables to missing before reading each observation.
A quick test with a data set of 100,000,000 observations and testing time between @s_lassen's sort check and Proc sort execution:
2194 data work.sorttest; 2195 do id= 1 to 1E7; 2196 do order= 1 to 10; 2197 value= rand('uniform'); 2198 output; 2199 end; 2200 end; 2201 run; NOTE: The data set WORK.SORTTEST has 100000000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 11.65 seconds user cpu time 4.42 seconds system cpu time 1.35 seconds memory 391.90k OS Memory 42016.00k Timestamp 09/11/2024 02:27:34 PM Step Count 122 Switch Count 0 2202 options fullstimer; 2203 %let sorted=NO; 2204 data _null_; 2205 do until(eof); 2206 set work.sorttest (keep=id order) end=eof; 2207 by id order; 2208 end; 2209 call symputx('SORTED','YES'); 2210 run; NOTE: There were 100000000 observations read from the data set WORK.SORTTEST. NOTE: DATA statement used (Total process time): real time 12.31 seconds user cpu time 10.96 seconds system cpu time 0.68 seconds memory 553.53k OS Memory 42016.00k Timestamp 09/11/2024 02:27:47 PM Step Count 123 Switch Count 0 2211 2212 %put &=sorted; SORTED=YES 2213 2214 proc sort data=work.sorttest; 2215 by id order; 2216 run; NOTE: There were 100000000 observations read from the data set WORK.SORTTEST. NOTE: The data set WORK.SORTTEST has 100000000 observations and 3 variables. NOTE: PROCEDURE SORT used (Total process time): real time 41.89 seconds user cpu time 29.75 seconds system cpu time 5.11 seconds memory 1064875.09k OS Memory 1104732.00k Timestamp 09/11/2024 02:28:29 PM Step Count 124 Switch Count 0
Proc sort would likely run slower with lots of variables as more data swapping goes on, BUT in that case TAGSORT would likely reduce the time to something similar to shown.
YMMV if 1) the data is actually in an external database 2) sorting data over a network connection as multiple load and bandwidth factors come into play
The time difference I see here is likely less than getting a cup of coffee in a near-by break room. With data in this order of magnitude I might say just use proc sort and get a cup of coffee (or tea or ...).
So how "huge" is "huge"? As in number of observations and variables?
Using PROC DATASETS
to check the SORTEDBY
attribute:
SAS stores metadata on whether a dataset has been sorted using the SORTEDBY
attribute. You can check this metadata with PROC DATASETS
.
In the output, if the sortedby
variable has a value, it indicates that the dataset is sorted by the corresponding variables. If the dataset is not sorted, the sortedby
variable will be blank.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.