SAS Data Management

supp · Posted 12-27-2019 10:07 AM

Hello SAS community. In my SAS shop we have many users and can work with large datasets. We use SAS 9.4 on a Linux OS. Every so often we are reminded to delete datasets no longer needed or to consider compressing datasets using SAS compression.

As I am sure many of you already know SAS currently offers two compression algorithms, character (using the RLE algorithm ) or binary (using the RDC algorithm). My understanding is character compression generally works better when the data set is mostly character data. The binary compression generally works better when the data set has mostly numeric data.

I would love to hear from the community how you determine if SAS compression should be used? Is there an official policy? Or does each user decide for themselves if they want to compress their data sets? If you do use SAS compression how are you deciding which option (char or binary) to use? Are there any good "rules of thumb" that can be applied?

I understand the disadvantages of SAS compression can be increased CPU usage and actually increasing the size of the data set. Is there any other disadvantages you have come across using SAS compression? Any other considerations that should be taken into account before compressing data sets?

tomrvincent · Posted 12-27-2019 10:55 AM

I've found that normalizing and compressing the datasets saves more spaces than compression alone. Getting rid of redundant data really pays off.

supp · Posted 12-27-2019 11:01 AM

Thanks @tomrvincent , that makes a lot of sense. Are you using one of the SAS compression options when you compress your datasets? If so, how do you decide which one to use?

tomrvincent · Posted 12-27-2019 11:35 AM

What I do is compress each resulting normalized dataset (dimension/fact table, if you will) if doing so actually saves space (sometimes it doesn't). I haven't bothered to try the different options just because I've already saved thru normalization. The rest is gravy. 🙂

TomKari · Posted 12-27-2019 11:14 AM

Hi, supp

After never having used compression in 30 years, I recently bumped across a use case where it worked perfectly. This is a text analysis project, with a huge number of text strings, of which the length can vary wildly (most short, but a few very long).

I can't remember the exact number, but using the RLE algorithm reduced the size of the datasets by about 90%. I didn't notice any difference in processing time.

Tom

supp · Posted 12-27-2019 11:19 AM

@TomKari , that is a really good result! Being your data is mostly (or all) character data I am guessing RLE give you the best result. Out of curiosity did you also try binary compression?

TomKari · Posted 12-27-2019 11:54 AM

No, given that the expected compression benefits would come from the long character fields, I only used the RLE algorithm.

mkeintz · Posted 12-27-2019 11:52 AM

I agree that you are describing a good use case. The downside of sas compressed datasets is that you are restricted to sequential data processing. I.e. no indexes, no SET ... POINT=. However normalization without compression might save substantial space and still support direct access.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 12-27-2019 09:16 PM

@mkeintz wrote:
I agree that you are describing a good use case. The downside of sas compressed datasets is that you are restricted to sequential data processing. I.e. no indexes, no SET ... POINT=. However normalization without compression might save substantial space and still support direct access.

@mkeintz Are you sure? That's not what the SAS log tells me.

28         options ps=max msglevel=i;
29         data have(compress=yes index=(indvar));
30           length charvar1000 $1000;
31           call missing(charvar1000);
32           do i=1 to 100000;
33             if mod(i,10)=1 then indvar+1;
34             output;
35           end;
36           stop;
37         run;

NOTE: The data set WORK.HAVE has 100000 observations and 3 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Simple index indvar has been defined.
NOTE: Compressing data set WORK.HAVE decreased size by 96.23 percent. 
      Compressed is 59 pages; un-compressed would require 1565 pages.
NOTE: DATA statement used (Total process time):
      real time           0.10 seconds
      cpu time            0.12 seconds
      

38         
39         data want1;
40           do point=1 to nobs by 1000;
41             set have point=point nobs=nobs;
42             output;
43           end;
44           stop;
45         run;

NOTE: The data set WORK.WANT1 has 100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

46         
47         proc sql;
48           create table want2 as
49           select *
50           from have
51           where indvar=10
52           ;
INFO: Index indvar selected for WHERE clause optimization.
NOTE: Table WORK.WANT2 created, with 10 rows and 3 columns.

53         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

supp · Posted 12-27-2019 09:33 PM

@Patrik, thanks for sharing your findings! That is very interesting. Do you mind sharing what version of SAS and Operating System were used for your tests?

Patrick · Posted 12-27-2019 09:37 PM

AUTOMATIC SYSVLONG 9.04.01M5P091317
AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 CentOS Linux release 7.5.1804 (Core)

mkeintz · Posted 12-27-2019 11:20 PM

Well, I'm definitely not sure any more. My understanding was that some compressions generated irregular observation lengths, making the implementation of direct access exceedingly tricky, and not implemented by SAS.

The SAS index records the index value and each RID (record id) associated with the index value. When every record is the same size, as would be the case with uncompressed SAS data sets, knowing the RID lets you know exactly which physical page (which are also of constant size) of data contains the record(s) of interests, which in turn means you can directly access only the pages needed. I'm not clear on how knowing the RID for compressed data sets can let you know which pages to read, … unless the compression keeps the observations uniform in length.

Thanks for the example. I'll have to re-map my understanding of SAS compression.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 12-28-2019 03:16 AM

You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.

PS it might be that binary compression puts observation boundaries within bytes, and then this would not work anymore. Maybe @Patrick could rerun his experiment with compression=binary to clear this up?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

mkeintz · Posted 12-28-2019 04:33 PM

@Kurt_Bremser wrote:

You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.

@Kurt_Bremser : I agree with your quoting of "only". All the SAS documentation I have seen on sas indexing refers to record ID, never record start position. Moreover, if sas compressed observations actually vary in length, then either a preceding variable length integer or a convention of some sort of record terminator would have to be introduced in each observation in a sas data set.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Patrick · Posted 12-28-2019 08:14 PM

@Kurt_Bremser wrote:

Maybe @Patrick could rerun his experiment with compression=binary to clear this up?

@Kurt_Bremser Just did it. On a SAS Log level no difference to character compression.

SAS Data Management

Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Re: Thoughts on using SAS compression

Follow Us

What is...

SAS Data Management

Our biggest data and AI event of the year.

Follow Us

What is...