Solved: SAS Compress and Reuse - Risks?

tedway · Posted 11-01-2017 04:18 PM

I have a dataset on our network drive that is 37 GB in size. Doing a simple count of one variable using proc sql can take over an hour, which makes the dataset unusable.

I read about compress and gave it a try.

data new (compress=Yes reuse=yes);
set old;
run;

The dataset is now 1.5 GB and the same count query takes two minutes (still slow but way better than before).

The only negatives I've been about to find in regards to using compress are that it can make the file slower to access in some cases and apparently you can't address observations by observation number.

Is there any other reason why I shouldn't use these settings by default with larger datasets going forward?

LinusH · Posted 11-02-2017 03:47 AM

Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.

Data never sleeps

View solution in original post

Reeza · Posted 11-01-2017 04:42 PM

Have you added indexes to your data set?

tedway · Posted 11-01-2017 04:47 PM

I did but I didn't see a performance improvement. I may need to play around with indexing a different variable or multiple variables.

My main concern is finding out later that there is a downside to compressing that I'm not aware of now.

Kurt_Bremser · Posted 11-02-2017 08:50 AM

An addendum: almost all our production datasets are stored with compress=yes. The cost in CPU cycles for uncompressing a RLE compressed dataset is negligible compared to the savings in space and therefore the savings in I/O. Only datasets where the compression rate is too small or non-existent (actual increase in physical size) are excluded.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 11-01-2017 05:13 PM

First of all, stop working on network drives. Especially when your network is so slow. Any modern storage reads a GB in less than 10 seconds.

Other than reading by obs number, compressed datasets are unproblematic. Just keep in mind that sometimes using compress=yes can actually increase the size, so keep an eye on your logs.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LinusH · Posted 11-02-2017 03:47 AM

Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.

Data never sleeps

SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Re: SAS Compress and Reuse - Risks?

Registration is open