BookmarkSubscribeRSS Feed

Often I get this:

 

NOTE: There were 63 observations read from the data set LIB.TOTALDRUGRX.
NOTE: The data set LIB.TOTALDRUGRX has 63 observations and 5 variables.
NOTE: Compressing data set LIB.TOTALDRUGRX increased size by 100.00 percent.
      Compressed is 2 pages; un-compressed would require 1 pages.

 

 

So the compressed dataset is bigger than the uncompressed version?!?

 

How about not compressing it if the result is bigger?

11 Comments
Reeza
Super User

Depending on the size you cannot know until the full compression is applied. There are rules that are followed so it doesn’t always happen, this usually occurs on small datasets where it’s not problematic anyways.  There was a lengthy discussion of this recently on here that has the correct reference. It would take more time to uncompress once it’s been compressed, so your suggestion would actually take more time in the long run. 

tomrvincent
Rhodochrosite | Level 12

Except that I sometimes get this message:

 

NOTE: Compression was disabled for data set TEMP.OCID because compression overhead would increase the size of the data set.

 

So, clearly, SAS is smart enough to *sometimes* tell that compression shouldn't be applied.  My suggestion is to simply make that the case *all* times.

 

Until that happens, I'm seeing a pattern: Compression is bad for small datasets so I'm going to add a filter to drop them from consideration.

Reeza
Super User

I highly suspect any time devoted to determine which ones to drop would be more time and resource intensive than just leaving it alone. 

 

Relevant section from the docs for others:

When a request is made to compress a data set, SAS attempts to determine whether compression will increase the size of the file. SAS examines the lengths of the variables. If, due to the number and lengths of the variables, it is not possible for the compressed file to be at least 12 bytes (for a 32-bit host) or 24 bytes (for a 64-bit host) per observation smaller than an uncompressed version, compression is disabled and a message is written to the SAS log.

 

http://documentation.sas.com/?docsetId=lrcon&docsetTarget=p0y0x1j67vtqhnn1on1cs7trim16.htm&docsetVer...

tomrvincent
Rhodochrosite | Level 12

Your suspicion failed.  It only took me a minute to figure it out.  Time well spent.

Kurt_Bremser
Super User

In determining if compression makes sense, SAS should include the overall size of a dataset into it's consideration. Any dataset that only needs one or two pages is useless to compress, and that's easy to catch. Large datasets will always be in the responsibility of the programmer. I do have datasets that take half an hour to write, look compressible at first, but then increase in size (or are reduced by 3%, which does not justify the additional CPU cost).

To rewrite such datasets automatically without compression would be, ahem, sub-optimal.

ChrisNZ
Tourmaline | Level 20

In determining if compression makes sense, SAS should include the overall size of a dataset into it's consideration. 

 

I suppose (I hope) that's the gist of the idea here. SAS seems to only take observation characteristics into account to decide whether compression should be applied or not, which is not the smartest.

 

On the other hand, I'd argue that while compressing 1-page data sets because they have many variables is silly, the cost of wrongly compressing such tiny data sets is negligible. So the gain made by being smarter may not be worth the development effort. 

 

tomrvincent
Rhodochrosite | Level 12

I think I'll just write my own macro to compress a dataset to temp space and then compare it to the original...if it's smaller, I'll keep it.

Kurt_Bremser
Super User

To bring this into perspective:

During development of new code, I routinely add the compress=yes option to all output datasets. After any test run, I have to inspect the log anyway (see Maxims 25, 30 and some others). During that, I also check for the compression rate, and remove the option if it's not necessary. Compared with all the other effort (documentation, check-in to the VS, handover to job-control), that check is so minuscule it does not matter at all.

I guess that is also the reason why there has not yet been a demand for an automatic check by the SAS system.

tomrvincent
Rhodochrosite | Level 12

"Maxims 25, 30 and some others"?!?

 

 

Anyway, I finished my macro...here it is, if anyone's interested.

 

https://communities.sas.com/t5/SAS-Communities-Library/Compress-library-tips-based-on-an-existing-ma...

 

Kurt_Bremser
Super User

Oh, I see that the comments here don't get the same footnotes than elsewhere:

https://communities.sas.com/t5/SAS-Communities-Library/Maxims-of-Maximally-Efficient-SAS-Programmers...