SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

SAS Compress and Reuse - Risks?

Accepted Solution Solved
Reply
Occasional Contributor
Posts: 7
Accepted Solution

SAS Compress and Reuse - Risks?

I have a dataset on our network drive that is 37 GB in size. Doing a simple count of one variable using proc sql can take over an hour, which makes the dataset unusable. 

 

I read about compress and gave it a try. 

 

data new (compress=Yes reuse=yes);
set old;
run;

The dataset is now 1.5 GB and the same count query takes two minutes (still slow but way better than before). 

 

The only negatives I've been about to find in regards to using compress are that it can make the file slower to access in some cases and apparently you can't address observations by observation number. 

 

Is there any other reason why I shouldn't use these settings by default with larger datasets going forward? 


Accepted Solutions
Solution
‎11-02-2017 08:17 AM
Super User
Posts: 5,852

Re: SAS Compress and Reuse - Risks?

Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.
Data never sleeps

View solution in original post


All Replies
Super User
Posts: 23,320

Re: SAS Compress and Reuse - Risks?

Have you added indexes to your data set? 

Occasional Contributor
Posts: 7

Re: SAS Compress and Reuse - Risks?

I did but I didn't see a performance improvement. I may need to play around with indexing a different variable or multiple variables. 

 

My main concern is finding out later that there is a downside to compressing that I'm not aware of now. 

Super User
Posts: 9,919

Re: SAS Compress and Reuse - Risks?

An addendum: almost all our production datasets are stored with compress=yes. The cost in CPU cycles for uncompressing a RLE compressed dataset is negligible compared to the savings in space and therefore the savings in I/O. Only datasets where the compression rate is too small or non-existent (actual increase in physical size) are excluded.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Super User
Posts: 9,919

Re: SAS Compress and Reuse - Risks?

First of all, stop working on network drives. Especially when your network is so slow. Any modern storage reads a GB in less than 10 seconds.

Other than reading by obs number, compressed datasets are unproblematic. Just keep in mind that sometimes using compress=yes can actually increase the size, so keep an eye on your logs.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
How to post code
Solution
‎11-02-2017 08:17 AM
Super User
Posts: 5,852

Re: SAS Compress and Reuse - Risks?

Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.
Data never sleeps
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 270 views
  • 1 like
  • 4 in conversation