BookmarkSubscribeRSS Feed
Ronein
Meteorite | Level 14

Let's say that I have a big data set and I want to know if this data set is sorted by variable CustID.

What is the way to check it?

I know that I can look at subset of few rows and see if it is sorted but it is still not perfect way because it might happen that in the next rows data set is not sorted

 

 

20 REPLIES 20
Reeza
Super User
You can run proc contents and check the SORTED output or use the SORT* variables in sashelp.vtable for a more dynamic approach if you want to check via code.
Ronein
Meteorite | Level 14

I run proc   contents and it is written  sorted No

proc contents  data=csp.twoyrs_3;Run;

Ronein_0-1725470354898.png

 

 

Howver I run code using by statemernt and didnt get any error so probably it is sorted!

How can I find that it is sorted???

 

Please note that I run this code and no have any error

data want;
pop7=0;
failure=0;
do until(last.LAKOACH);
set csp.twoyrs_3 end=eof;
by LAKOACH;
if (VALID_FROM_DTTM='01NOV2022'd and pop_lomeforat_wppr=7) then POP7=1;
if VALID_FROM_DTTM in ('01OCT2022'd,'01SEP2022'd ,'01AUG2022'd,'01JUL2022'd,'01JUN2022'd) and ind_shlili_6=1 then FAILURE=1;
end;
totalN+1;  /***Total rows checked----2,493,500****/
totalpop7 + POP7;  /****Base pop that is determined by belonging to pop 7 in month Nov 2022---108,552***/
totalpop7failure + (pop7 and failure);/***10,623***/
keep total: ;
if eof;
run;
PeterClemmensen
Tourmaline | Level 20

Do you mean whether the data set is sorted, meaning having been sorted previously by Proc Sort or Proc SQL?

 

Or do you mean whether the data set is in a particular order, regardless of having been sorted?

 

For example, this data set is ordered by ID. But is has not been sorted. 

 

data have;
input ID x;
datalines;
1 10 
2 20 
3 30 
;

 

As a consequence, running Proc Contents against the data like this

 

proc contents data = have;
run;

 

will tell you that the data is not sorted. Because it has not been sorted (nor validated) by SAS. 

 

PeterClemmensen_0-1725470650200.png

 

If what you are after is a check of whether the data is in some particular order, then the Presorted Option on the Proc Sort Statement is the way to go. 

 

Ronein
Meteorite | Level 14
Good question.
When I say sorted I mean that values are sorted by values by ascending order. It doesn't mean that proc sort or sort via proc sql was done before. Maybe values are built in sorted without any further code thst done to sort it. My question was how to check if values in Data set are sorted ? Yes/no
Reeza
Super User

AFAIK, there isn't a way to know this without scanning the whole data set. So if a sort is important or required you need to sort the data so that there is an explicit sort indicator created on the dataset or so that you can trust it. If the data is already sorted, the sort procedure will be efficient.

 

Ronein
Meteorite | Level 14
You kean that If the data set is already sorted then proc sort procedure will not be done again so it is efficient?
Reeza
Super User
I know. If you need to sort after you check if the data is sorted and the answer is no, then you may as well sort in the first place.
Patrick
Opal | Level 21

@Ronein Ideally you know your data and don't need such a check.

As already mentioned by @PeterClemmensen early on Proc Sort has a presorted option that makes the Proc first check if data is already sorted. 

data have;
  set sashelp.class;
run;

proc sort data=have presorted;
  by name;
run;

Patrick_0-1726008277278.png

 

Tom
Super User Tom
Super User

You can let SAS generate an error for you.

Examples:

275  %let sorted=NO;
276  data _null_;
277    if eof then call symputx('sorted','YES');
278    set sashelp.class end=eof;
279    by age;
280  run;

ERROR: BY variables are not properly sorted on data set SASHELP.CLASS.
eof=0 Name=Alfred Sex=M Age=14 Height=69 Weight=112.5 FIRST.Age=1 LAST.Age=1 _ERROR_=1 _N_=1
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.00 seconds


281  %put &=sorted;
SORTED=NO
282  %let sorted=NO;
283  data _null_;
284    if eof then call symputx('sorted','YES');
285    set sashelp.class end=eof;
286    by name;
287  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


288  %put &=sorted;
SORTED=YES

Or if you want to test yourself then use LAG() function and avoid the ERROR in the LOG.

317  %let sorted=NO;
318  data _null_;
319    if eof then call symputx('sorted','YES');
320    set sashelp.class end=eof;
321    if age < lag(age) then stop;
322  run;

NOTE: There were 3 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


323  %put &=sorted;
SORTED=NO
324  %let sorted=NO;
325  data _null_;
326    if eof then call symputx('sorted','YES');
327    set sashelp.class end=eof;
328    if name < lag(name) then stop;
329  run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


330  %put &=sorted;
SORTED=YES

 

Reeza
Super User
From an efficiency standpoint, which is faster - sorting ahead of time or checking with the data step. My intuition says sort since the procedures are optimized while a data step has to run through all records (unless the out of order data is early in the dataset)
Tom
Super User Tom
Super User

Depends on the goal. 

The best sort procedures take O(NlogN) time and need at a minimum space to write two copies of the data (and for larger datasets even more for spill files).

 

Reading the dataset takes just O(N).  Plus since it does not need to write any data it should run even faster.  And when the data is not sorted it should on average fail before reading half of the data. But if the goal is to actually sort the data then all of that is wasted time except when it is already sorted.  So if you expect the data to be sorted a large percent of the time and just want detect then perhaps it is worth doing.

 

Note also that PROC SORT should automatically detect when the dataset's metadata indicates it is sorted and not actually do anything in that case.  So PROC SORT should also benefit when the data is sorted (and SAS knows it is).

ballardw
Super User

Here is an example of LOG running Sort on a data set already sorted:

 

62   proc sort data=sashelp.class;
63      by name;
64   run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Note that if the set had been sorted by a different variable(s) then it would sort it.

Ronein
Meteorite | Level 14
Thanks. You wrote -The best sort procedures take O(NlogN) time.
Can you show code ?
I know just the regular proc sort procedure but working on 100 million rows might be long time?
Tom
Super User Tom
Super User

@Ronein wrote:
Thanks. You wrote -The best sort procedures take O(NlogN) time.
Can you show code ?
I know just the regular proc sort procedure but working on 100 million rows might be long time?

Huh?  PROC SORT is the code to use to sort a dataset.  It has already been optimized by SAS.  There are option you can use too optimize it if you know something about the data.  For example if you have lot of variables (or really long character variables) and just a few short BY variables then using TAGSORT option might improve performance by not having to copy all of that data as many times.  But if you only have a few variables then TAGSORT will probably actually take longer since it will have to do more work.

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 20 replies
  • 1944 views
  • 15 likes
  • 9 in conversation