BookmarkSubscribeRSS Feed
A_SAS_Man
Pyrite | Level 9

I am on SAS version 7.15 HF2 (7.100.5.6112)(64-bit)

 

Is it possible to keep a data set sorted (where the sorted flag is "Yes" and the Validated flag is "Yes") when doing operations on the data set? I am specifically looking at an instance where I'm deleting a column unrelated to the sort order, and my data set is losing it's sorted properties, which is very annoying as I'm working with an extremely large data set, so having to re-sort* every time is very time consuming.

 

*I am aware of the option to proc sort with the pre sort option, where it just checks to see if the data set is in the proper order, this is still very time consuming on my data even though the data set has retained it's order.

 

Some example code below to demonstrate what I'm running into.

 

data test;
 input c1 c2 c3 c4 $3.;
 datalines;
 1 3 5 YES
 3 5 1 YES
 2 6 8 YES
 1 2 7 YES
 6 1 9 YES
 9 1 3 NO
 4 1 1 NO
 ;

/*sort on first two columns*/
proc sort data=test;
 by c1 c2;
 run;

/* Confirm data is sorted and validated*/
proc contents data=test varnum;

/*Remove unrelated third column*/
data test2(drop=c3);
 set test;
 run;

/*Confirm the data has lost it's sorted flags*/
proc contents data=test2 varnum;

Is there any way to prevent this loss of metadata? It seems extremely strange that SAS loses that flag when it is not altering the order of the rows (confirmed through proc sort with the presorted option) or removing any of the columns in the by statement.

9 REPLIES 9
PeterClemmensen
Tourmaline | Level 20

I don't think there is any way to preserve the sorted/validated metadata. 

 

However, you can avoid sorting if you use the Sortedby= Dataset Option in the data set that drops c3. This sets the Sorted Flag. 

 

Next, use the PROC SORT with the Presorted Option to set the validated flag. PROC SORT does not sort the data when the Sortedby information is available.

 

data test;
 input c1 c2 c3 c4 $3.;
 datalines;
 1 3 5 YES
 3 5 1 YES
 2 6 8 YES
 1 2 7 YES
 6 1 9 YES
 9 1 3 NO
 4 1 1 NO
 ;

/*sort on first two columns*/
proc sort data=test;
   by c1 c2;
run;

/* Confirm data is sorted and validated*/
proc contents data=test varnum;

/*Remove unrelated third column*/
data test2(drop=c3 sortedby=c1 c2);
 set test;
run;

proc sort data=test2 presorted;
   by c1 c2;
run;

/*Confirm the data has lost it's sorted flags*/
proc contents data=test2 varnum;
run;
A_SAS_Man
Pyrite | Level 9

Now I am under the impression that the presorted option will do the same thing whether or not the data set has the sortedby flag set (i.e. it just confirms the sort is accurate if possible). It still has to churn through and verify the sort in both cases though, is this what you meant or do you think there is some efficiency gain by setting the sortedby flag? This runs contrary to the tests I have done.

PeterClemmensen
Tourmaline | Level 20

No you are right. I just consulted the documentation and there are no performance gains there.

 

Unfortunately, I don't think there are any shortcuts here, other than using the Presorted Option in PROC SORT 🙂

A_SAS_Man
Pyrite | Level 9

This seems like a very curious quirk to me, I would be interested in an explanation as to why SAS operates this way in a data step if you (or anyone else reading!) has more knowledge on the internal workings of SAS that explain it.

PeterClemmensen
Tourmaline | Level 20

To some extend I agree with you. It would be nice it such metadata could be preserved if we do not alter the actual observations. Then again, I like that the Validated flag can not be validated by a human, but has to be set by PROC SQL or PROC SORT.

 

Also, it may not help you in you actual situation, but perhaps you can simply use the Drop=c3 Option directly in PROC SORT or drop the variable even before that. That way, PROC SORT does not have to read the c3 variable at all, which will probably save you some time and I/O. 

ballardw
Super User

Maybe not useful in your full use case but if you create test2 this way

proc sort data=test out=test2(drop=c3);
 by c1 c2;
 run;
 proc contents data=test2 varnum;
 run;

If the input set is already sorted then you get a note such as:

NOTE: Input data set is already sorted; it has been copied to the output data set.

I'm not sure but I suspect that might even be faster to execute than the data step approach for large sets.

 

Since you use a data step there is no obligation for the output to resemble the input as far as sort order goes because of thing like selecting which records are output, possible changes to the values of the sort by variables, merging with other data sets and I'm sure there are a few other concerns.

 

What are you doing that the "sorted flag" is more important than the actual data though?

A_SAS_Man
Pyrite | Level 9

I did not think of using proc sort to remove columns, that is a good idea I will have to test out. Thank you.

 

We are managing a SAS data lake/workbench/sandbox (pick your term) where the main means of using the different data sets in conjunction with each other is to merge join them to other data sets which requires the data be sorted, so having validated sorted datasets is very helpful to us.

 

Edit: To clarify, the actual data is also very important, I'm not sure I would characterize it as "less important than the sorted flag". 🙂

ballardw
Super User

As long as your data steps don't do anything to change the values of by variables AND you don't remove values/records in a haphazard manner the steps that would require BY group processing should work fine with out the flag. SAS will tell if the data is out of order.

 

And for some things you may even want to use BY NOTSORTED...

A_SAS_Man
Pyrite | Level 9

Things will work, but they won't work as well as they would if the data sets are sitting there already sorted and validated. If they are not sorted and validated, then every time someone joins to them they need to affirm the sort order is correct rather than just executing the join.

 

Edit: I forgot to ask, what sort of things would you think it would be optimal to have the "BY NOTSORTED" option on?

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 956 views
  • 0 likes
  • 3 in conversation