- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello SAS community. In my SAS shop we have many users and can work with large datasets. We use SAS 9.4 on a Linux OS. Every so often we are reminded to delete datasets no longer needed or to consider compressing datasets using SAS compression.
As I am sure many of you already know SAS currently offers two compression algorithms, character (using the RLE algorithm ) or binary (using the RDC algorithm). My understanding is character compression generally works better when the data set is mostly character data. The binary compression generally works better when the data set has mostly numeric data.
I would love to hear from the community how you determine if SAS compression should be used? Is there an official policy? Or does each user decide for themselves if they want to compress their data sets? If you do use SAS compression how are you deciding which option (char or binary) to use? Are there any good "rules of thumb" that can be applied?
I understand the disadvantages of SAS compression can be increased CPU usage and actually increasing the size of the data set. Is there any other disadvantages you have come across using SAS compression? Any other considerations that should be taken into account before compressing data sets?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I've found that normalizing and compressing the datasets saves more spaces than compression alone. Getting rid of redundant data really pays off.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks @tomrvincent , that makes a lot of sense. Are you using one of the SAS compression options when you compress your datasets? If so, how do you decide which one to use?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi, supp
After never having used compression in 30 years, I recently bumped across a use case where it worked perfectly. This is a text analysis project, with a huge number of text strings, of which the length can vary wildly (most short, but a few very long).
I can't remember the exact number, but using the RLE algorithm reduced the size of the datasets by about 90%. I didn't notice any difference in processing time.
Tom
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@TomKari , that is a really good result! Being your data is mostly (or all) character data I am guessing RLE give you the best result. Out of curiosity did you also try binary compression?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No, given that the expected compression benefits would come from the long character fields, I only used the RLE algorithm.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@mkeintz wrote:
I agree that you are describing a good use case. The downside of sas compressed datasets is that you are restricted to sequential data processing. I.e. no indexes, no SET ... POINT=. However normalization without compression might save substantial space and still support direct access.
@mkeintz Are you sure? That's not what the SAS log tells me.
28 options ps=max msglevel=i; 29 data have(compress=yes index=(indvar)); 30 length charvar1000 $1000; 31 call missing(charvar1000); 32 do i=1 to 100000; 33 if mod(i,10)=1 then indvar+1; 34 output; 35 end; 36 stop; 37 run; NOTE: The data set WORK.HAVE has 100000 observations and 3 variables. INFO: Multiple concurrent threads will be used to create the index. NOTE: Simple index indvar has been defined. NOTE: Compressing data set WORK.HAVE decreased size by 96.23 percent. Compressed is 59 pages; un-compressed would require 1565 pages. NOTE: DATA statement used (Total process time): real time 0.10 seconds cpu time 0.12 seconds 38 39 data want1; 40 do point=1 to nobs by 1000; 41 set have point=point nobs=nobs; 42 output; 43 end; 44 stop; 45 run; NOTE: The data set WORK.WANT1 has 100 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 46 47 proc sql; 48 create table want2 as 49 select * 50 from have 51 where indvar=10 52 ; INFO: Index indvar selected for WHERE clause optimization. NOTE: Table WORK.WANT2 created, with 10 rows and 3 columns. 53 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Patrik, thanks for sharing your findings! That is very interesting. Do you mind sharing what version of SAS and Operating System were used for your tests?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
AUTOMATIC SYSVLONG 9.04.01M5P091317 AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 3.10.0-862.14.4.el7.x86_64 #1 SMP Wed Sep 26 15:12:11 UTC 2018 x86_64 CentOS Linux release 7.5.1804 (Core)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Well, I'm definitely not sure any more. My understanding was that some compressions generated irregular observation lengths, making the implementation of direct access exceedingly tricky, and not implemented by SAS.
The SAS index records the index value and each RID (record id) associated with the index value. When every record is the same size, as would be the case with uncompressed SAS data sets, knowing the RID lets you know exactly which physical page (which are also of constant size) of data contains the record(s) of interests, which in turn means you can directly access only the pages needed. I'm not clear on how knowing the RID for compressed data sets can let you know which pages to read, … unless the compression keeps the observations uniform in length.
Thanks for the example. I'll have to re-map my understanding of SAS compression.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.
PS it might be that binary compression puts observation boundaries within bytes, and then this would not work anymore. Maybe @Patrick could rerun his experiment with compression=binary to clear this up?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Kurt_Bremser wrote:
You "only" have to switch from observation number to observation start position to address the observation from an index. Given 64-bit processing, this is not so hard to do.
@Kurt_Bremser : I agree with your quoting of "only". All the SAS documentation I have seen on sas indexing refers to record ID, never record start position. Moreover, if sas compressed observations actually vary in length, then either a preceding variable length integer or a convention of some sort of record terminator would have to be introduced in each observation in a sas data set.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Kurt_Bremser wrote:
Maybe @Patrick could rerun his experiment with compression=binary to clear this up?
@Kurt_Bremser Just did it. On a SAS Log level no difference to character compression.