Hi folks - long time listener, first time caller.
Quick background: I work in a distributed environment where a number of stakeholders take their proprietary data and translate it into a common data model. We have minimal input on how the stakeholders write their ETL code, we are concerned with evaluating compliance to the common data model. One requirement is to verify the datasets are sorted as specified.
1. Since we don't create the datasets to be evaluated, we (so I have read) can't trust that the metadata (proc contents or dictionary tables) would accurately capture sort information.
2. These are huge tables we are evaluating. If I tried a proc sort with the presorted option to verify the sort order, and the data is not already sorted, I run the risk of sorting the data: resource intensive and NOT what I want. I only want to verify whether it is sorted so I can notify the stakeholder that they need to sort the data.
proc sort data = in.ds presorted out = out.ds;
by varA varB varC;
run;
I'm looking for other suggestions outside of these. Can the sortedby= data step option be used to verify sort order, or is it only for creating sort order?
Thanks much!
Are you talking about data in your SAS system or residing in someone else's? Or data extracted from your data sets?
Please provide the source of the information that Proc contents or dictionary tables can't report the sort status correctly. My understanding is that information is in the data set "header" information and should be there regardless of "who" created the data.
I'll refer to the following thread as my very informal source: https://communities.sas.com/t5/SAS-Procedures/Accessing-checking-sort-order-in-a-dataset/m-p/29860#M...
The issue of who creates the data only factors in because I don't know how it is created; there is no standard for the various stakeholders other than the results have pass a compliance audit that I am writing.
If Proc contents not an option, how about a dummy data step to test the desired sort order?
data _null_;
set sashelp.prdsale;
by year product;
run;
ERROR: BY variables are not properly sorted on data set SASHELP.PRDSALE.
ACTUAL=$561.00 PREDICT=$979.00 COUNTRY=CANADA REGION=EAST DIVISION=EDUCATION PRODTYPE=FURNITURE
PRODUCT=SOFA QUARTER=4 YEAR=1994 MONTH=Dec FIRST.YEAR=0 LAST.YEAR=0 FIRST.PRODUCT=0 LAST.PRODUCT=0
_ERROR_=1 _N_=24
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 25 observations read from the data set SASHELP.PRDSALE.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.00 seconds
The step would fail at the first time, and you would know for sure whether or not the data is sorted.
Just a thought,
Ahmed
This is a really good approach... I need to flesh it out a little bit regarding capturing the error information as a variable and applying it to downstream logic, but I think the payoff is that it gives a very instantaneous answer without much overhead.
Thanks!
Agreed that this is a good approach ... here are a couple of tools you might want to add to the program.
After each DATA _NULL_ step, use macro language to capture the value of &SYSERR and write a message.
Once you have the process working for one data set, set the global option OPTIONS NOSYNTAXCHECK;
That way, your program can continue with checking another data set even if an error was encountered because of an unsorted data set.
Always check your logic thoroughly before using this option!.
You will need to check the value of these two system macro variables to check for Error Status & Error Text
Run a data _null_ step on the dataset with the same BY statement. It will fail if the sort order is not correct.
The sortvalidate system option can be used for this purpose. With this option (which is not the default) the sort procedure effectively sorts the data only if a sequence check fails.
You can trust SAS metadata (the descriptor portion of a SAS data set) IF the sort is also validated.
It is possible though that a data set is sorted without metadata stored that it's sorted (table sort2 in below sample code).
I'd be using a check via metadata first as this will perform much better than anything else.
Below some sample code for you to build upon.
/* create sample data
cheat: Not sorted with Metadata for Sorted but not Validated
sort: Sorted with Metadata for Sorted and Validated
sort2: Sorted but no Metadata
*/
data cheat(sortedby=id name) sort;
set sashelp.class;
id=ceil(ranuni(1)*100);
run;
proc sort data=sort out=sort;
by id name;
run;
data sort2;
set sort;
run;
%macro CheckSort(ds,sortvars);
proc contents
data=&ds
out =_check(keep=libname memname name sorted sortedby)
noprint;
run;quit;
%local sortvars_meta;
%let sortvars_meta=;
proc sql ;
select upcase(name) into :sortvars_meta separated by ' '
from _check
where
sorted=1 /* 1 if sort validated */
and not missing(sortedby) /* sequence of sort vars */
order by sortedby
;
quit;
%if &sortvars_meta = %upcase(&sortvars) %then %put DS &DS is sorted;
%else
%do;
%put Verify sort order for DS &DS;
data _null_;
set &ds(keep=&sortvars);
by &sortvars;
run;
%end;
%mend;
%CheckSort(sort,id name);
%CheckSort(sort2,id name);
%CheckSort(cheat,id name);
If time then consider to implement the data _null_ check step in a way that allows you to capture/reset the error if the source table isn't properly sorted so that it doesn't impact on further tests.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.