Hi all -
A function of quality assurance production code I maintain is to verify whether or not large stored tables are sorted. To that end I have been using the PRESORTED option in proc sort in conjunction with proc printto to:
1. Write the procedure results to an external log,
2. Read the results of the log back into a SAS data set, and
3. Parse the string looking for the specific text "Sort order of input data set has been verified"
This method works well enough, though I feel it isn't particularly efficient given the myriad platforms upon which this code is run (these are packages we distribute to remote organizations). I have to imagine that "behind the scenes" there has to be some kind of SAS system variable produced when using the PRESORTED option, though I can't find any documentation of it. It would be much preferred to capture an automatic system variable to perform the check rather than creating a volume of extraneous log files and parsing the text to get my result.
Does anyone know if such functionality exists?
Either the view SASHELP.Vcolumn, or if using Proc SQL, Dictionary.columns has information about which variables are used in sorting in the data set. The variable Sortedby has the order each variable appears on the sort order as numeric. If 0 in the variable then not sorted by that variable and if the set is not sorted then all the values are 0.
The view SASHELP.Vtable, or Dictionary.tables, has a few variables related to if the data is sorted and which sort sequence is used but not the variable(s) involved.
And if you get into information about indexes that is stored in SASHELP.Vindex or Dictionary.Indexes.
Much appreciation for the tip. The unfortunate angle though is that the PRESORTED option seems to work even when the metadata is inaccurate.
An example:
* Create dummy data for test purposes;
* Disclaimer: all values are fake;
data test;
length patid encounterid 4.;
infile datalines;
input patid encounterid;
datalines;
12345 876543
24256 987654
14233 263547
42987 763853
;
run;
* sort the data;
proc sort data = test;
by patid encounterid;
run;
* test the sortedby metadata values;
proc sql noprint;
select name, sortedby
into :names separated by " "
, :sortedby separated by " "
from dictionary.columns
where lowcase(libname) = "work" and
lowcase(memname) = "test";
quit;
%put &names &sortedby;
* Create a new data set from the original;
data test2;
set test;
run;
* Now look at the metadata again;
proc sql noprint;
select name, sortedby
into :names separated by " "
, :sortedby separated by " "
from dictionary.columns
where lowcase(libname) = "work" and
lowcase(memname) = "test2";
quit;
%put &names &sortedby;
* Look at the log output using the presorted option though;
proc sort data = test2 presorted;
by patid encounterid;
run;
The caveat is that one can sort a data set, but if one used that sorted data to make another data set (in the example, I literally did nothing to test), the sortedby metadata values for the resulting table don't carry over and are reset to 0.
Unfortunately (or fortunately, because it DOES keep my job interesting) in this distributed database environment in which I work, I don't have a say in (or even knowledge of) the code the remote locations use to create the data sets my QA program evaluates.
A couple of additional details to make your life even more interesting ...
A DATA step can change the values of BY variables, so that the new data set is no longer in order. Perfectly valid code:
* Create a new data set from the original;
data test2;
set test;
if patid=. then patid=999;
run;
Now the new data set is not in sorted order.
When you create a data set, you can specify a known order, such as:
* Create a new data set from the original;
data test2 (sortedby=patid encounterid);
set test;
if patid=. then patid=999;
run;
Now the SORTEDBY field reports the data set as being sorted, when the data is actually not sorted. So the VALIDATED field indicates that the SORTEDBY field has not been checked.
If you later use the data with a BY statement, you will get an error message. And if you attempt to sort it with:
proc sort data=test2;
by patid encounterid;
run;
SAS will skip the PROC SORT, since it sees it as unnecessary. (You will get a note in the log about PROC SORT being skipped.) To get the sort to run, you would need to add the FORCE option on the PROC statement.
@jvigeant wrote:
Much appreciation for the tip. The unfortunate angle though is that the PRESORTED option seems to work even when the metadata is inaccurate.
An example:
* Create dummy data for test purposes; * Disclaimer: all values are fake; data test; length patid encounterid 4.; infile datalines; input patid encounterid; datalines; 12345 876543 24256 987654 14233 263547 42987 763853 ; run; * sort the data; proc sort data = test; by patid encounterid; run; * test the sortedby metadata values; proc sql noprint; select name, sortedby into :names separated by " " , :sortedby separated by " " from dictionary.columns where lowcase(libname) = "work" and lowcase(memname) = "test"; quit; %put &names &sortedby; * Create a new data set from the original; data test2; set test; run; * Now look at the metadata again; proc sql noprint; select name, sortedby into :names separated by " " , :sortedby separated by " " from dictionary.columns where lowcase(libname) = "work" and lowcase(memname) = "test2"; quit; %put &names &sortedby; * Look at the log output using the presorted option though; proc sort data = test2 presorted; by patid encounterid; run;
The caveat is that one can sort a data set, but if one used that sorted data to make another data set (in the example, I literally did nothing to test), the sortedby metadata values for the resulting table don't carry over and are reset to 0.
Unfortunately (or fortunately, because it DOES keep my job interesting) in this distributed database environment in which I work, I don't have a say in (or even knowledge of) the code the remote locations use to create the data sets my QA program evaluates.
And does your previous Proc Contents approach show test2 as sorted? No it does not. So querying the dictionary table is no less valid than your previous approach.
SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures producing output data but your example uses data step).
> SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures
Interesting point, I only ever thought about the first two. I'd expect proc means to validate a sort by (type if more than one type is requested and by) class values too (what about when formats are used?). I wonder about the hash output object. I'll try that tomorrow if I find the time.
@ChrisNZ wrote:
> SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures
Interesting point, I only ever thought about the first two. I'd expect proc means to validate a sort by (type if more than one type is requested and by) class values too (what about when formats are used?). I wonder about the hash output object. I'll try that tomorrow if I find the time.
I am not sure that other procedures will set the sorted flag but there are so many that I am not assuming all of them wont set a sorted property. Proc Summary for example with a CLASS statement will not set the sorted flag for proc contents to report but the data will be in "sort" order by _type_ and the class variables in Summary's way.
Example this does not show sorted in proc summary but the data is in a very specific order equivalent to sort by Sex Age.
Proc summary data=sashelp.class nway; class sex age; var height; output out=work.summary mean= max=/autoname; run; Proc contents data=work.summary; run;
You could do something like this that youi can expect to be consistent across all systems:
data have; /*Make an unsorted data set*/
do i=1 to 18,1;
set sashelp.class;
output;
end;
run;
%let presorted=no;
%let indsn=have;
%let byvars=i;
data _null_;
set &indsn (keep=&byvars) nobs=nexpected;
by &byvars;
if _n_=nexpected then call symput("presorted","yes");
run;
%put &=presorted;
%macro sort;
%if &presorted=no %then %do;
%put ***** SORTING dataset &INDSN by &byvars *****;
proc sort data=&indsn;
by &byvars;
run;
%end;
%else %put ***** NO SORT REQUIRED *****;
%mend sort;
%sort;
The main point is that the DATA _NULL_ step never gets to the last obs (i.e. _N_ will never equal NEXPECTED) if the dataset is not sorted. Even if is only the last observation that is out of order, the automatic variable _N_ only reaches NEXPECTED-1 (because the BY statement tells SAS to look ahead).
So by defaulting macrovar &PRESORTED to no, and correcting it in the data step when appropriate, this code avoids scanning the log.
Of course, you'll probably want to make the macro coding a bit more complete than above.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.