Accessing/checking sort order in a dataset

shanesheu · Posted 03-13-2012 01:59 PM

Hi SAS communities,

I am trying to find a way to access and check the sort order of datasets, which I believed is stored as "metadata" that contains information on the attributes of datasets (e.g. sort order, number of variables...etc). So far, I have not had much success. The following are two ways that I've come up with so far, but they don't seem to do exactly what I need:

1) 1) There’s an option for Proc Sort called “presorted”, which checks through the dataset to see if any one of the observations are out of the specified sort order in the "By" statement. If it is sorted properly, no sorting will be done. Otherwise sorting will be done. The downside about this is that I don’t think it produces an indicator of some sort, and therefore if I have 60 partitions of one big dataset, every single one has to go through the proc sort presorted process. If one of the partitions is already sorted correctly, and all partitions are generated in the same process, then there's no point putting every partition through the process. Not very efficient.

2) I believe there are “dictionaries” that can only be accessed by proc sql, e.g. dictionary.tables or dictionary.columns, that tells you about the properties of the datasets, including the sort orders. From my research so far, it seems that the dictionary contains info for the active SAS session, and not for particular datasets. I might be wrong about this.

It would be great if someone can point me in the right way. Thanks in advance!

Shane

Haikuo · Posted 03-13-2012 02:30 PM

As far as I know, you can use those metadata tables to get what you want:

proc sql;

select sortedby from dictionary.columns

where libname='YOURLIBNAME' AND MEMNAME='YOURTABLE';

QUIT;

if sortedby is 0 then it is not presorted by proc sort or sql, otherwise, sortedby=1.

Regards,

Haikuo

shanesheu · Posted 03-16-2012 08:11 PM

I did a little investigation and found this (regarding Haikuo's suggestion):

Let's say you are trying to merge two datasets by id and study_entry_date.

So you proc sort those two datasets by those two variables, and check their dictionary.columns: sortedby=1 for id, and sortedby=2 for study_entry_date in each of the two datasets, so ok, great.

Now, if you are to merge those two dataset through a data step by those two variables, when you access the dictionary.columns for the output dataset, the sortedby values for all variables have become 0.

This suggests that, even if the resulting output is sorted correctly in the ordered specified, using dictionary.columns' sortedby to check whether the datasset is sorted properly still seems inadequate. It will wrongly label the correctly sorted output as unsorted.

This is why I wish proc sort can produce an indicator that tells you whether the dataset is sorted properly by the variables specified in the by statement, and then use that indicator as a switch for sorting the remaining partitions.

I'm currently looking into FriedEgg's suggestion, but it looks like sortedby is used as well. Anyways, I'll give it a try and see what happens.

Thanks

art297 · Posted 03-16-2012 08:24 PM

Are you trying to avoid sorting or just make sure that the sort is accomplished? If it is the latter, and you will be using proc sort, just use the FORCE option.

There is NO guarantee that the metadata is correct!

Haikuo · Posted 03-16-2012 08:36 PM

Like I said, sortedby only indicates whether the CURRENT table has been sorted by Proc sort or SQL order by.

Even though both of your input tables are proc sorted, and your output table from Merge is supposed to be sorted as well, however, your output table has NOT been officially sorted by Proc Sort or SQL order by.

So basically you are looking for a indicator that can be flaged by proc sort, and somehow can be transplanted to the new table after the Merge. In your case, if you are positive that your table has been sorted, you can manually flag it by using data step option (sortedby=yourvariablename), while SAS has to sort it before it can know it is sorted.

FriedEgg · Posted 03-16-2012 08:42 PM

data merge(sortedby=prim_id sec_id);

merge tab1(in=b) tab2(in=a);

by prim_id sec_id;

if a and b;

run;

shanesheu · Posted 03-16-2012 09:19 PM

In response to art297, yes I am trying to avoid sorting if the data is already sorted correctly. This brings me to my response for Haikuo: sorry for being unclear, I forgot to state that I am trying to avoid proc sort or SQL order by if the data is already sorted correctly.

I did a quick investigation and indeed, the sortedby option for data step seems to flag the sort order without putting the data through proc sort or SQL order by. I am going to see how this fits into the macro I am writing. Thanks for the example as well, FriedEgg.

I'll return when I get this working or if I need further help. Will remember to award credits for answers once all are well.

Thanks a bunch guys, happy Friday.

Astounding · Posted 03-16-2012 09:36 PM

If you add the sortedby= option, you better be right! SAS will not check the sorted order for you. For example:

proc sort data=mydata;

by name;

run;

data mydata2 (sortedby=amount);

set mydata;

run;

proc sort data=mydata2;

by amount;

run;

The DATA step runs without error, even though the data are not in order. In addition, SAS skips the second PROC SORT, because it believes the data are in order. In most cases where sortedby= appears, the data are sorted. But there is no guarantee, so like I said, you better be right. Even this MERGE is legal:

data combine;

merge a b;

by name;

if amount > 5 then name='Fred';

run;

Even though the data set is created when you MERGE BY NAME, the final data set is not in order by NAME. Of course this doesn't happen very often. But are the chances greater than zero? You decide.

Good luck.

shanesheu · Posted 03-19-2012 01:58 PM

You are definitely right (after I tested out what you described). I guess art297 was right to say that there's no guarantee that the metadata is correct. I'll continue to experiment and do some research on my own for now. Thanks a lot everyone.

Astounding · Posted 03-19-2012 02:27 PM

shanesheu,

Some of the metadata can help. When you run a PROC CONTENTS, there are two related pieces of information: SORTEDBY and VALIDATED. I imagine the metadata contains the same information.

SORTEDBY is the sorted order.

VALIDATED is a flag that indicates whether the sorted order is guaranteed (meaning the sorting was performed by either PROC SORT or PROC SQL).

Just in case it helps ...

FriedEgg · Posted 03-13-2012 02:37 PM

proc sql;

create table class as

select * from sashelp.class

order by sex,name;

quit;

%macro isSorted(libname=WORK ,memname= ,by=);

/* libname=libref where data to check resides, default=work

memname=data set name to check

by=space delimited list of by variables is expected order */

proc format;

value tf

0 = 'not sorted'

1 = 'sorted by:'

2 = 'improperly sorted by:';

quit;

%global isSorted;

proc sql noprint;

select case

when sum(sortedby)>0 then 1

else 0

end,

name

into :isSorted,:columnName1-:columnName100

from sashelp.vcolumn where libname="%upcase(&libname)" and memname="%upcase(&memname)" and sortedby>0

order by sortedby;

quit;

%let cols=;

%if &isSorted>0 %then %do;

%let cols=%upcase(&columnName1);

%do i=2 %to &sqlobs;

%let cols=&cols, %upcase(&&columnName&i);

%end;

%if &by>. %then %do;

%let by=%upcase(%sysfunc(tranwrd(&by,%str( ),%str(, ))));

%if &by ne &cols %then %let isSorted=2;

%end;

%put Dataset &memname in library &libname is %trim(%sysfunc(putn(&isSorted,tf.))) &cols;

%mend;

%isSorted(libname=WORK ,memname=CLASS ,by=name sex)

Macro Output to Log:

Dataset CLASS in library WORK is improperly sorted by: SEX, NAME

Macro Global Variable Output:

isSorted=1 if sorted, =0 if not sorted ,=2 if sorted in order other than requested

Ksharp · Posted 03-14-2012 01:33 AM

Or if you have maken an index for this variable firstly, then you will not pre-sort it any more at next code.

data class(index=(name));
 set sashelp.class;run;
data x;
 merge class(keep=name sex) class(keep=name weight height);
 by name;
run;

Ksharp