I am going through an exercise compressing all our datasets (around 600). I am doing this with a simple:
---- compare datasets with proc compare and manual look at obs numbers -----
if OK then:
– delete dsname ;
– change dsname_tmp = dsname;
I've put a lot of checks and balances in as well - like does the temp file exist already, or is the file already compressed etc..
I was trying to retain the sort order data, which is returned by contents sorted and sortedby, by adding a by statement into the set statement. These values are not returned by the newly created dataset. Obviously the data order does not change with a simple set statement, but what impact does this have when SAS determines when a SORT is needed. Does SAS look at these values (a), or does it actually read through the data (b) to determine whether a sort is needed.
If it’s (b) then no problem, but if it's (a), sorting this number of files - some are many millions of obs - would be a huge overhead. Or will the sort be quick, considering the data is sorted already?
There is a dataset option called "sortedby" but this can be manually set, and thus will not be a solution if (a) is applicable.
Just answering my own post, in case someone else has to do this:
Using the noclone option on proc copy. I was trying to avoid two passes of the data, but I have to have a solution in place by tonight. So am copying to work then back again with compress=yes. It seems to retain all the pertinent information. Got a lot of work on my plate today....
My impression, and I'm not sure where I got it, is that the SORTEDBY information is used mainly in PROC SQL, where it will affect how joins are done. I think there was an example in a paper at WUSS showing how a changed input data set would lead to either hashing or to a merged join.
If you want to put SORTEDBY information into an output data set, use a data set option:
data x (sortedby=a b c);
by a b c;
That will give you SORTEDBY, but you'll lose constraint and index information.
If I had to do this, I would see whether a combination of SQL's CREATE TABLE x LIKE Y and PROC APPEND would do the job.
(Just tried it; CREATE LIKE doesn't do the job.)
Message was edited by: JackHamilton at Apr 3, 2006 1:27 PM
My suggestion is similar to Jack's. Either use PROC SQL to copy the data sets or PROC DATASETS. Both of these are data set oriented languages, so they don't have as many constraints as the DATA step. When PROC DATASETS copies a data set, it will maintain the sort assertion in the copy along with indexes and integrity constraints.
The DATA statement creates a brand-new data set. With a brand-new data set, there are no special attributes defined on it, unless the programmer specifies them with data set and/or system options.
The main reason is so the DATA statement operates the same for all DATA steps. That is, if the DATA step were to have more than one statement, it becomes unclear which attributes of an input data set apply to an output data set. A program may read from more than one input data set or may modify variables which change their sortedness. In these cases, it is difficult to determine which attributes apply to an output data set. And, the attributes that apply may be specific to the data being processed.
Let's look at JonT's situation. After PROC SORT sorts a data set, a sort assertion is set on the data set. Before PROC SORT sorts a data set, it checks the sort assertion. A sort is not performed if the sort BY variables are the same as the sort assertion. A sort is performed if there is no sort assertion or if the sort assertion is different than the sort BY variables. In JonT's case, after using a DATA step to copy the data set, there would be no sort assertion set, so PROC SORT would resort the sorted data.
There are some related changes coming. In SAS 9.2, PROC SORT will have options to validate the sort order of a data set and set a sort assertion if the data set is sorted. The cost is a sequential pass through the data without the expense of sorting. We are investigating solutions that would avoid a sequential pass through the data. Until then, I would investigate a PROC SQL or PROC DATASETS solution. SAS Technical Support ( http://support.sas.com/techsup/contact/index.html ) can provide assistance with these PROCs, if necessary.
-- Jason Secosky, SAS R&D / DATA step
Message was edited by: JasonS@SAS at Apr 4, 2006 10:16 AM
Especial interest: "In SAS 9.2, PROC SORT will have options to validate the sort order of a data set and set a sort assertion if the data set is sorted. "
Will look forward to more on that.
As it turned out, the exercise didn't take as long as I had feared, and I ran through the entire 600 files in around 2 hours, using the proc copy noclone method. I used the output of proc contents to determine if the file needed compression. Eventual space saving of around (a very significant) 15%.
Is it possible to compress and copy a dataset at the same time using PROC DATASETS? Or, do I have to re-create dataset with COMPRESS=YES option in a temporary location and then copy the compressed dataset back (or overwrite it:
data a (COMPRESS=YES REUSE=YES); set a; run?
one of these copy features I found difficult, was to achieve the completeness of proc copy (all data set attributes including label, sortedby, constraints, indexes, etc.) and keep the output data set in the same library,
on SAS-L someone (JackHamilton perhaps?) showed the simple way
proc datasets nolist library= thislib ;
delete myNewName ;
proc append base= thisLib.myNewName data= thisLib.myOldName ;
all attributes including SORTEDBY are copied (but I don't expect audit trails would be copied)
Also, don't forget that COMPRESS=YES is not the only option.
COMPRESS=BINARY is more CPU intensive, but works better for data sets mainly containing numbers.
And lastly, using both disk compression and sas compression can further reduce the size (50% for both disk and sas compression, so overall compression rate of 75% on many tables in my case, using NTFS compression on 400 to 1500 MB tables).
Oh, and lastly lastly, the SPDE engine seems a bit more space efficient then the V9 engine, both when compressing and not, so you might want to give it a try.