Let's say that I have a big data set and I want to know if this data set is sorted by variable CustID.
What is the way to check it?
I know that I can look at subset of few rows and see if it is sorted but it is still not perfect way because it might happen that in the next rows data set is not sorted
I run proc contents and it is written sorted No
proc contents data=csp.twoyrs_3;Run;
Howver I run code using by statemernt and didnt get any error so probably it is sorted!
How can I find that it is sorted???
Please note that I run this code and no have any error
data want;
pop7=0;
failure=0;
do until(last.LAKOACH);
set csp.twoyrs_3 end=eof;
by LAKOACH;
if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
if VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd) and ind_shlili_6=1 then FAILURE=1;
end;
totalN+1; /***Total rows checked----2,493,500****/
totalpop7 + POP7; /****Base pop that is determined by belonging to pop 7 in month Nov 2022---108,552***/
totalpop7failure + (pop7 and failure);/***10,623***/
keep total: ;
if eof;
run;
Do you mean whether the data set is sorted, meaning having been sorted previously by Proc Sort or Proc SQL?
Or do you mean whether the data set is in a particular order, regardless of having been sorted?
For example, this data set is ordered by ID. But is has not been sorted.
data have;
input ID x;
datalines;
1 10
2 20
3 30
;
As a consequence, running Proc Contents against the data like this
proc contents data = have;
run;
will tell you that the data is not sorted. Because it has not been sorted (nor validated) by SAS.
If what you are after is a check of whether the data is in some particular order, then the Presorted Option on the Proc Sort Statement is the way to go.
AFAIK, there isn't a way to know this without scanning the whole data set. So if a sort is important or required you need to sort the data so that there is an explicit sort indicator created on the dataset or so that you can trust it. If the data is already sorted, the sort procedure will be efficient.
@Ronein Ideally you know your data and don't need such a check.
As already mentioned by @PeterClemmensen early on Proc Sort has a presorted option that makes the Proc first check if data is already sorted.
data have;
set sashelp.class;
run;
proc sort data=have presorted;
by name;
run;
You can let SAS generate an error for you.
Examples:
275 %let sorted=NO; 276 data _null_; 277 if eof then call symputx('sorted','YES'); 278 set sashelp.class end=eof; 279 by age; 280 run; ERROR: BY variables are not properly sorted on data set SASHELP.CLASS. eof=0 Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 _ERROR_=1 _N_=1 NOTE: The SAS System stopped processing this step because of errors. NOTE: There were 2 observations read from the data set SASHELP.CLASS. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds 281 %put &=sorted; SORTED=NO 282 %let sorted=NO; 283 data _null_; 284 if eof then call symputx('sorted','YES'); 285 set sashelp.class end=eof; 286 by name; 287 run; NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds 288 %put &=sorted; SORTED=YES
Or if you want to test yourself then use LAG() function and avoid the ERROR in the LOG.
317 %let sorted=NO; 318 data _null_; 319 if eof then call symputx('sorted','YES'); 320 set sashelp.class end=eof; 321 if age < lag(age) then stop; 322 run; NOTE: There were 3 observations read from the data set SASHELP.CLASS. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 323 %put &=sorted; SORTED=NO 324 %let sorted=NO; 325 data _null_; 326 if eof then call symputx('sorted','YES'); 327 set sashelp.class end=eof; 328 if name < lag(name) then stop; 329 run; NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 330 %put &=sorted; SORTED=YES
Depends on the goal.
The best sort procedures take O(NlogN) time and need at a minimum space to write two copies of the data (and for larger datasets even more for spill files).
Reading the dataset takes just O(N). Plus since it does not need to write any data it should run even faster. And when the data is not sorted it should on average fail before reading half of the data. But if the goal is to actually sort the data then all of that is wasted time except when it is already sorted. So if you expect the data to be sorted a large percent of the time and just want detect then perhaps it is worth doing.
Note also that PROC SORT should automatically detect when the dataset's metadata indicates it is sorted and not actually do anything in that case. So PROC SORT should also benefit when the data is sorted (and SAS knows it is).
Here is an example of LOG running Sort on a data set already sorted:
62 proc sort data=sashelp.class; 63 by name; 64 run; NOTE: Input data set is already sorted, no sorting done. NOTE: PROCEDURE SORT used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
Note that if the set had been sorted by a different variable(s) then it would sort it.
@Ronein wrote:
Thanks. You wrote -The best sort procedures take O(NlogN) time.
Can you show code ?
I know just the regular proc sort procedure but working on 100 million rows might be long time?
Huh? PROC SORT is the code to use to sort a dataset. It has already been optimized by SAS. There are option you can use too optimize it if you know something about the data. For example if you have lot of variables (or really long character variables) and just a few short BY variables then using TAGSORT option might improve performance by not having to copy all of that data as many times. But if you only have a few variables then TAGSORT will probably actually take longer since it will have to do more work.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.