BookmarkSubscribeRSS Feed
deleted_user
Not applicable
I am going through an exercise compressing all our datasets (around 600). I am doing this with a simple:

data libname.dsname_tmp;
set libname.dsname;
run;
---- compare datasets with proc compare and manual look at obs numbers -----
if OK then:
proc datasets
– delete dsname ;
– change dsname_tmp = dsname;

I've put a lot of checks and balances in as well - like does the temp file exist already, or is the file already compressed etc..

I was trying to retain the sort order data, which is returned by contents sorted and sortedby, by adding a by statement into the set statement. These values are not returned by the newly created dataset. Obviously the data order does not change with a simple set statement, but what impact does this have when SAS determines when a SORT is needed. Does SAS look at these values (a), or does it actually read through the data (b) to determine whether a sort is needed.

If it’s (b) then no problem, but if it's (a), sorting this number of files - some are many millions of obs - would be a huge overhead. Or will the sort be quick, considering the data is sorted already?

There is a dataset option called "sortedby" but this can be manually set, and thus will not be a solution if (a) is applicable.

So which is it (a) or (b)?
8 REPLIES 8
deleted_user
Not applicable
Just answering my own post, in case someone else has to do this:

Using the noclone option on proc copy. I was trying to avoid two passes of the data, but I have to have a solution in place by tonight. So am copying to work then back again with compress=yes. It seems to retain all the pertinent information. Got a lot of work on my plate today....
JackHamilton
Lapis Lazuli | Level 10
My impression, and I'm not sure where I got it, is that the SORTEDBY information is used mainly in PROC SQL, where it will affect how joins are done. I think there was an example in a paper at WUSS showing how a changed input data set would lead to either hashing or to a merged join.

If you want to put SORTEDBY information into an output data set, use a data set option:

data x (sortedby=a b c);
set y;
by a b c;
run;

That will give you SORTEDBY, but you'll lose constraint and index information.

If I had to do this, I would see whether a combination of SQL's CREATE TABLE x LIKE Y and PROC APPEND would do the job.

(Just tried it; CREATE LIKE doesn't do the job.)
Message was edited by: JackHamilton at Apr 3, 2006 1:27 PM
JasonS_SAS
SAS Employee
My suggestion is similar to Jack's. Either use PROC SQL to copy the data sets or PROC DATASETS. Both of these are data set oriented languages, so they don't have as many constraints as the DATA step. When PROC DATASETS copies a data set, it will maintain the sort assertion in the copy along with indexes and integrity constraints.

The DATA statement creates a brand-new data set. With a brand-new data set, there are no special attributes defined on it, unless the programmer specifies them with data set and/or system options.

The main reason is so the DATA statement operates the same for all DATA steps. That is, if the DATA step were to have more than one statement, it becomes unclear which attributes of an input data set apply to an output data set. A program may read from more than one input data set or may modify variables which change their sortedness. In these cases, it is difficult to determine which attributes apply to an output data set. And, the attributes that apply may be specific to the data being processed.

Let's look at JonT's situation. After PROC SORT sorts a data set, a sort assertion is set on the data set. Before PROC SORT sorts a data set, it checks the sort assertion. A sort is not performed if the sort BY variables are the same as the sort assertion. A sort is performed if there is no sort assertion or if the sort assertion is different than the sort BY variables. In JonT's case, after using a DATA step to copy the data set, there would be no sort assertion set, so PROC SORT would resort the sorted data.

There are some related changes coming. In SAS 9.2, PROC SORT will have options to validate the sort order of a data set and set a sort assertion if the data set is sorted. The cost is a sequential pass through the data without the expense of sorting. We are investigating solutions that would avoid a sequential pass through the data. Until then, I would investigate a PROC SQL or PROC DATASETS solution. SAS Technical Support ( http://support.sas.com/techsup/contact/index.html ) can provide assistance with these PROCs, if necessary.

-- Jason Secosky, SAS R&D / DATA step
Message was edited by: JasonS@SAS at Apr 4, 2006 10:16 AM
deleted_user
Not applicable
Thanks all for input here.

Especial interest: "In SAS 9.2, PROC SORT will have options to validate the sort order of a data set and set a sort assertion if the data set is sorted. "

Will look forward to more on that.

As it turned out, the exercise didn't take as long as I had feared, and I ran through the entire 600 files in around 2 hours, using the proc copy noclone method. I used the output of proc contents to determine if the file needed compression. Eventual space saving of around (a very significant) 15%.
deleted_user
Not applicable
Is it possible to compress and copy a dataset at the same time using PROC DATASETS? Or, do I have to re-create dataset with COMPRESS=YES option in a temporary location and then copy the compressed dataset back (or overwrite it:
data a (COMPRESS=YES REUSE=YES); set a; run;)?
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
It is possible to use PROC COPY with NOCLONE after setting COMPRESS=YES option in advance of PROC COPY.

Scott Barry
SBBWorks, Inc.
Peter_C
Rhodochrosite | Level 12
one of these copy features I found difficult, was to achieve the completeness of proc copy (all data set attributes including label, sortedby, constraints, indexes, etc.) and keep the output data set in the same library,
until,
on SAS-L someone (JackHamilton perhaps?) showed the simple way
proc datasets nolist library= thislib ;
delete myNewName ;
run ;
quit ;
proc append base= thisLib.myNewName data= thisLib.myOldName ;
run;
all attributes including SORTEDBY are copied (but I don't expect audit trails would be copied)

Hope this helps ;

PeterC
ChrisNZ
Tourmaline | Level 20
Also, don't forget that COMPRESS=YES is not the only option.

COMPRESS=BINARY is more CPU intensive, but works better for data sets mainly containing numbers.

And lastly, using both disk compression and sas compression can further reduce the size (50% for both disk and sas compression, so overall compression rate of 75% on many tables in my case, using NTFS compression on 400 to 1500 MB tables).

Oh, and lastly lastly, the SPDE engine seems a bit more space efficient then the V9 engine, both when compressing and not, so you might want to give it a try.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1699 views
  • 0 likes
  • 6 in conversation