BookmarkSubscribeRSS Feed
jvigeant
Fluorite | Level 6

 

Hi all - 

A function of quality assurance production code I maintain is to verify whether or not large stored tables are sorted. To that end I have been using the PRESORTED option in proc sort in conjunction with proc printto to:
1. Write the procedure results to an external log,
2. Read the results of the log back into a SAS data set, and
3. Parse the string looking for the specific text "Sort order of input data set has been verified"

 

This method works well enough, though I feel it isn't particularly efficient given the myriad platforms upon which this code is run (these are packages we distribute to remote organizations). I have to imagine that "behind the scenes" there has to be some kind of SAS system variable produced when using the PRESORTED option, though I can't find any documentation of it. It would be much preferred to capture an automatic system variable to perform the check rather than creating a volume of extraneous log files and parsing the text to get my result.

Does anyone know if such functionality exists? 

9 REPLIES 9
ballardw
Super User

Either the view SASHELP.Vcolumn, or if using Proc SQL, Dictionary.columns has information about which variables are used in sorting in the data set. The variable Sortedby has the order each variable appears on the sort order as numeric. If 0 in the variable then not sorted by that variable and if the set is not sorted then all the values are 0.

 

The view SASHELP.Vtable, or Dictionary.tables, has a few variables related to if the data is sorted and which sort sequence is used but not the variable(s) involved.

 

And if you get into information about indexes that is stored in SASHELP.Vindex or Dictionary.Indexes.

jvigeant
Fluorite | Level 6

Much appreciation for the tip. The unfortunate angle though is that the PRESORTED option seems to work even when the metadata is inaccurate. 

An example:

* Create dummy data for test purposes;
* Disclaimer: all values are fake; 
data test;
length patid encounterid 4.;
infile datalines;
input patid encounterid;
datalines;
12345 876543
24256 987654
14233 263547
42987 763853
;
run;

* sort the data; 
proc sort data = test;
by patid encounterid;
run;

* test the sortedby metadata values;
proc sql noprint;
  select name, sortedby
    into :names separated by " "
       , :sortedby separated by " "
  from dictionary.columns
  where lowcase(libname) = "work" and 
        lowcase(memname) = "test";
quit;
%put &names &sortedby;

* Create a new data set from the original;
data test2;
set test; 
run;  

* Now look at the metadata again;
proc sql noprint;
  select name, sortedby
    into :names separated by " "
       , :sortedby separated by " "
  from dictionary.columns
  where lowcase(libname) = "work" and 
        lowcase(memname) = "test2";
quit;
%put &names &sortedby;

* Look at the log output using the presorted option though;
proc sort data = test2 presorted; 
by patid encounterid;
run;

The caveat is that one can sort a data set, but if one used that sorted data to make another data set (in the example, I literally did nothing to test), the sortedby metadata values for the resulting table don't carry over and are reset to 0.
Unfortunately (or fortunately, because it DOES keep my job interesting) in this distributed database environment in which I work, I don't have a say in (or even knowledge of) the code the remote locations use to create the data sets my QA program evaluates. 

ChrisNZ
Tourmaline | Level 20

The validated flag is what matters. See here

 

Astounding
PROC Star

A couple of additional details to make your life even more interesting ...

 

A DATA step can change the values of  BY variables, so that the new data set is no longer in order.  Perfectly valid code:

 

* Create a new data set from the original;
data test2;
set test; 
if patid=. then patid=999;
run;  

Now the new data set is not in sorted order.  

 

When you create a data set, you can specify a known order, such as:

 

* Create a new data set from the original;
data test2 (sortedby=patid encounterid);
set test; 
if patid=. then patid=999;
run;  

Now the SORTEDBY field reports the data set as being sorted, when the data is actually not sorted.  So the VALIDATED field indicates that the SORTEDBY field has not been checked.

 

If you later use the data with a BY statement, you will get an error message.  And if you attempt to sort it with:

 

proc sort data=test2;
   by patid encounterid;
run;

SAS will skip the PROC SORT, since it sees it as unnecessary.  (You will get a note in the log about PROC SORT being skipped.)  To get the sort to run, you would need to add the FORCE option on the PROC statement.

ballardw
Super User

@jvigeant wrote:

Much appreciation for the tip. The unfortunate angle though is that the PRESORTED option seems to work even when the metadata is inaccurate. 

An example:

* Create dummy data for test purposes;
* Disclaimer: all values are fake; 
data test;
length patid encounterid 4.;
infile datalines;
input patid encounterid;
datalines;
12345 876543
24256 987654
14233 263547
42987 763853
;
run;

* sort the data; 
proc sort data = test;
by patid encounterid;
run;

* test the sortedby metadata values;
proc sql noprint;
  select name, sortedby
    into :names separated by " "
       , :sortedby separated by " "
  from dictionary.columns
  where lowcase(libname) = "work" and 
        lowcase(memname) = "test";
quit;
%put &names &sortedby;

* Create a new data set from the original;
data test2;
set test; 
run;  

* Now look at the metadata again;
proc sql noprint;
  select name, sortedby
    into :names separated by " "
       , :sortedby separated by " "
  from dictionary.columns
  where lowcase(libname) = "work" and 
        lowcase(memname) = "test2";
quit;
%put &names &sortedby;

* Look at the log output using the presorted option though;
proc sort data = test2 presorted; 
by patid encounterid;
run;

The caveat is that one can sort a data set, but if one used that sorted data to make another data set (in the example, I literally did nothing to test), the sortedby metadata values for the resulting table don't carry over and are reset to 0.
Unfortunately (or fortunately, because it DOES keep my job interesting) in this distributed database environment in which I work, I don't have a say in (or even knowledge of) the code the remote locations use to create the data sets my QA program evaluates. 


And does your previous Proc Contents approach show test2 as sorted? No it does not. So querying the dictionary table is no less valid than your previous approach.

SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures producing output data but your example uses data step).

ChrisNZ
Tourmaline | Level 20

> SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures

Interesting point, I only ever thought about the first two. I'd expect proc means to validate a sort by (type if more than one type is requested and by) class values too (what about when formats are used?). I wonder about the hash output object. I'll try that tomorrow if I find the time.

ballardw
Super User

@ChrisNZ wrote:

> SORTED is only set when Proc Sort or Proc SQL Order by has been used on the data set (or possibly other procedures

Interesting point, I only ever thought about the first two. I'd expect proc means to validate a sort by (type if more than one type is requested and by) class values too (what about when formats are used?). I wonder about the hash output object. I'll try that tomorrow if I find the time.


I am not sure that other procedures will set the sorted flag but there are so many that I am not assuming all of them wont set a sorted property. Proc Summary for example with a CLASS statement will not set the sorted flag for proc contents to report but the data will be in "sort" order by _type_ and the class variables in Summary's way.

Example this does not show sorted in proc summary but the data is in a very specific order equivalent to sort by Sex Age.

Proc summary data=sashelp.class nway;
   class sex age;
   var height;
   output out=work.summary mean= max=/autoname;
run;

Proc contents data=work.summary;
run;
mkeintz
PROC Star

You could do something like this that youi can expect to be consistent across all systems:

 

data have; /*Make an unsorted data set*/
  do i=1 to 18,1;
    set sashelp.class;
    output;
  end;
run;

%let presorted=no;
%let indsn=have;
%let byvars=i;


data _null_;
  set &indsn (keep=&byvars)  nobs=nexpected;
  by &byvars;
  if _n_=nexpected then call symput("presorted","yes");
run;

%put &=presorted;

%macro sort;
  %if &presorted=no %then %do;
    %put ***** SORTING dataset &INDSN  by &byvars *****;
    proc sort data=&indsn;
    by &byvars;
    run;
  %end;
  %else %put ***** NO SORT REQUIRED *****;
%mend sort;
%sort;

 

The main point is that the DATA _NULL_ step never gets to the last obs (i.e. _N_ will never equal NEXPECTED) if the dataset is not sorted.  Even if is only the last observation that is out of order, the automatic variable _N_ only reaches NEXPECTED-1 (because the BY statement tells SAS to look ahead).

 

So by defaulting macrovar &PRESORTED to no, and correcting it in the data step when appropriate, this code avoids scanning the log.

 

Of course, you'll probably want to make the macro coding a bit more complete than above.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 9 replies
  • 1390 views
  • 3 likes
  • 5 in conversation