SAS Data Management

SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop, SAS Data Preparation and others
BookmarkSubscribeRSS Feed
supp
Pyrite | Level 9

Hello SAS community. In my SAS shop we have many users and can work with large datasets. We use SAS 9.4 on a Linux OS. Every so often we are reminded to delete datasets no longer needed or to consider compressing datasets using SAS compression.

 

As I am sure many of you already know SAS currently offers two compression algorithms, character (using the RLE algorithm ) or binary (using the RDC algorithm). My understanding is character compression generally works better when the data set is mostly character data. The binary compression generally works better when the data set has mostly numeric data.

 

I would love to hear from the community how you determine if SAS compression should be used? Is there an official policy? Or does each user decide for themselves if they want to compress their data sets? If you do use SAS compression how are you deciding which option (char or binary) to use? Are there any good "rules of thumb" that can be applied?  

 

I understand the disadvantages of SAS compression can be increased CPU usage and actually increasing the size of the data set. Is there any other disadvantages you have come across using SAS compression?  Any other considerations that should be taken into account before compressing data sets?

 

 

28 REPLIES 28
tomrvincent
Rhodochrosite | Level 12

I've found that normalizing and compressing the datasets saves more spaces than compression alone.  Getting rid of redundant data really pays off.

supp
Pyrite | Level 9

Thanks @tomrvincent , that makes a lot of sense. Are you using one of the SAS compression options when you compress your datasets? If so, how do you decide which one to use?

tomrvincent
Rhodochrosite | Level 12
What I do is compress each resulting normalized dataset (dimension/fact table, if you will) if doing so actually saves space (sometimes it doesn't). I haven't bothered to try the different options just because I've already saved thru normalization. The rest is gravy. 🙂
TomKari
Onyx | Level 15

Hi, supp

 

After never having used compression in 30 years, I recently bumped across a use case where it worked perfectly. This is a text analysis project, with a huge number of text strings, of which the length can vary wildly (most short, but a few very long).

 

I can't remember the exact number, but using the RLE algorithm reduced the size of the datasets by about 90%. I didn't notice any difference in processing time.

 

Tom

supp
Pyrite | Level 9

@TomKari , that is a really good result!  Being your data is mostly (or all) character data I am guessing RLE give you the best result. Out of curiosity did you also try binary compression?

TomKari
Onyx | Level 15

No, given that the expected compression benefits would come from the long character fields, I only used the RLE algorithm.

mkeintz
PROC Star
I agree that you are describing a good use case. The downside of sas compressed datasets is that you are restricted to sequential data processing. I.e. no indexes, no SET ... POINT=. However normalization without compression might save substantial space and still support direct access.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

@mkeintz wrote:
I agree that you are describing a good use case. The downside of sas compressed datasets is that you are restricted to sequential data processing. I.e. no indexes, no SET ... POINT=. However normalization without compression might save substantial space and still support direct access.

@mkeintz  Are you sure? That's not what the SAS log tells me.

 

28         options ps=max msglevel=i;
29         data have(compress=yes index=(indvar));
30           length charvar1000 $1000;
31           call missing(charvar1000);
32           do i=1 to 100000;
33             if mod(i,10)=1 then indvar+1;
34             output;
35           end;
36           stop;
37         run;

NOTE: The data set WORK.HAVE has 100000 observations and 3 variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Simple index indvar has been defined.
NOTE: Compressing data set WORK.HAVE decreased size by 96.23 percent. 
      Compressed is 59 pages; un-compressed would require 1565 pages.
NOTE: DATA statement used (Total process time):
      real time           0.10 seconds
      cpu time            0.12 seconds
      

38         
39         data want1;
40           do point=1 to nobs by 1000;
41             set have point=point nobs=nobs;
42             output;
43           end;
44           stop;
45         run;

NOTE: The data set WORK.WANT1 has 100 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
      

46         
47         proc sql;
48           create table want2 as
49           select *
50           from have
51           where indvar=10
52           ;
INFO: Index indvar selected for WHERE clause optimization.
NOTE: Table WORK.WANT2 created, with 10 rows and 3 columns.

53         quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

 

 

supp
Pyrite | Level 9

@Patrik, thanks for sharing your findings! That is very interesting. Do you mind sharing what version of SAS and Operating System were used for your tests?

Patrick
Opal | Level 21
AUTOMATIC SYSVLONG 9.04.01M5P091317
AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 CentOS Linux release 7.5.1804 (Core) 
mkeintz
PROC Star

Well, I'm definitely not sure any more.  My understanding was that some compressions generated irregular observation lengths, making the implementation of direct access exceedingly tricky, and not implemented by SAS.

 

The SAS index records the index value and each RID (record id) associated with the index value.  When every record is the same size, as would be the case with uncompressed SAS data sets, knowing the RID lets you know exactly which physical page (which are also of constant size) of data contains the record(s) of interests, which in turn means you can directly access only the pages needed.   I'm not clear on how knowing the RID for compressed data sets can let you know which pages to read, … unless the compression keeps the observations uniform in length.

 

Thanks for the example.  I'll have to re-map my understanding of SAS compression.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Kurt_Bremser
Super User

You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.

 

PS it might be that binary compression puts observation boundaries within bytes, and then this would not work anymore. Maybe @Patrick could rerun his experiment with compression=binary to clear this up?

mkeintz
PROC Star

@Kurt_Bremser wrote:

You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.

 


@Kurt_Bremser :  I agree with your quoting of "only".  All the SAS documentation I have seen on sas indexing refers to record ID, never record start position.  Moreover, if sas compressed observations actually vary in length, then either a preceding variable length integer or a convention of some sort of record terminator would have to be introduced in each observation in a sas data set.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Patrick
Opal | Level 21

@Kurt_Bremser wrote:

Maybe @Patrick could rerun his experiment with compression=binary to clear this up?


@Kurt_Bremser  Just did it. On a SAS Log level no difference to character compression.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 28 replies
  • 4407 views
  • 20 likes
  • 12 in conversation