I just ran it on my UE on a MacBook Pro, and added an additional order of magnitude to the dataset size.
Compression rate with both character and binary was around 96 %, but binary took 2.5 seconds real time, while character was finished in 1.5 seconds.
I guess I will stay with character compression for the time being.
I believe that an internal index was built on compressed datasets in the Base engine years ago. That index allows random processing and should supports set point= processing.
Almost all our production datasets are compressed with character (RLE). When writing a new batch job, I use compress=yes and look what the log tells me. If I get less than 10% compression rate, I omit compression.
Not only does compression save disk space, it also speeds up the ETL processes which are (almost) always I/O bound and benefit from having to move less data from and to storage. Care must be taken when processing datasets with a high compression rate. Sorting can overload the SASUTIL location, so using tagsort may be necessary. And don't forget to also use compression in WORK when dealing with such datasets.
Great points @Kurt_Bremser ! Is there ever a scenario you would use binary compression instead of char? Or why do you prefer to use char compression?
Where I work we decided to set compression to BINARY as a default option when starting all SAS sessions via SAS AUTOEXEC programs. We have been doing this for both SAS 9.3 and 9.4 over many years. This has worked well for us by conserving disk space and also ensuring we almost never run out of WORK space unless someone does a really silly query.
We consume a lot of data from SQL Server and the BINARY setting means we don't need to worry about shortening long character variables as they are all compressed. Uncompressed tables can be often around 5 times larger than uncompressed ones! For the type of datasets we have BINARY provides better compression than CHARACTER.
I also know of other SAS sites that compress their data by default.
Thanks @SASKiwi . You stated that for your type of datasets binary is the better option. Is this because you have a lot of numeric data in your datasets? Or how did you determine binary was the better option?
As @mkeintz mentioned a compressed dataset can't take advantage of an index, do you ever find this to be problematic?
@supp - I did some testing on some of our typical datasets and found BINARY gave better compression of around 10 percent. I also found that SAS jobs processing large datasets ran faster because of the reduced IO. While elapsed time was less CPU time increased, but only by a few percent. IMO universal compression is definitely worth considering if you process a lot of medium to large datasets.
@supp - Regarding limitations with compression and indexes. As stated elsewhere, we universally compress with the BINARY option and never had any problems with indexes. I'm pretty sure we don't use direct access (POINT=), but if we did you can just add COMPRESS = NO to the DATA steps using it to avoid any problems.
Please note that universal compression means WORK libraries as well so the benefits of compressing apply to controlling WORK space as well.
Before considering compression, think about whether you have good rules in place for the lengths used within the SAS data set. Many times, I have seen SAS data sets use $200 characters for a field that only needs a few characters ... usually because the database definition for the field was varchar200. Those who set up the field in the data base took advantage of the fact that varchar automatically adjusts to the number of characters needed. But when extracted, SAS uses the full length of 200 every time. If you have processes to examine fields and the length that they actually require, you will be a step ahead of the game whether or not you add compression afterwards.
To try out a few ideas I jiffied up the following program, which anyone is welcome to try:
%let RecordCount = 10000; /* How many records do we want? */
data Strings(drop=_:);
length TestString $32767;
_Alpha = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; /* Pool of letters to pick one from */
call streaminit(1912281316);
do RecordKey = 1 to &RecordCount.; /* Generate the records */
call missing(TestString);
_Len = min(int((rand('chisq', 2) * 10) ** 2), 31993); /* Create a number between 0 and 31993, greatly skewed to the right */
do _Words = 1 to _Len by 6; /* Generate five character words separated by a space to fill up the length */
do _LetterPos = _Words to _Words + 4;
_RandLetter = ceil(rand('uniform') * 26); /* Get a random number from 1 to 26... */
substr(TestString, _LetterPos, 1) = substr(_Alpha, _RandLetter, 1); /* ...and put that letter into the string */
end;
end;
output;
end;
run;
It generates some number (10000) records. Each record has between 0 and 32K bytes of five random characters followed by a space, etc. This is a pretty good representation of the data I'm working with. The lengths are massively skewed to the right, with many, many short records and only a few long ones, but because of the long ones my variable has to be $32767. I've used streaminit, so you should get the same results as me.
I'm copying this dataset to two libraries, one compressed, one not compressed. Here are my results (read off of windows explorer):
Records | Compressed MB | Uncompressed MB | Compression ratio |
1,000 | 1.536 | 32.832 | 95.32% |
10,000 | 9.472 | 326.040 | 97.09% |
100,000 | 87.552 | 3,257.436 | 97.31% |
I assume that because the variables are strings of random data, all of the benefit is coming from compressing out the spaces at the end of the records. Clearly, I'm gaining great benefit from the compression!
Tom
I have not had great success predicting which data sets will compress well. I think I know how RLE should work, but there must be factors I'm not aware of. So my practice with large data sets that I will create multiple times is to try all three settings on a data set, and just see which one works best.
It used to be said that compression could make programs run faster because I/O is slow, and the decrease in I/O with compressed data more than made up for the increase in CPU usage. I don't know if that's still true with solid state disks, where I/O is much faster.
Our Unix disks have hardware compression, so using SAS compression might result in redundant compression at the physical level. I don't know what overall effect that would have - faster? slower? Maybe compression doesn't matter as much as it used to.
The POINTOBS= data set option "Specifies whether SAS creates compressed data sets whose observations can be randomly accessed or sequentially accessed." Also, "Specifying POINTOBS=YES does not affect the efficiency of retrieving information from a data set. It does increase CPU usage by approximately 10% when creating a compressed data set and when updating or adding information to it."
REUSE= is a related data set option: "Specifies whether new observations can be written to available space in compressed SAS data sets." The documentation doesn't say what happens in uncompressed data sets.
My reading of this is that SAS keeps an internal table that maps observation number to physical location, and so the order in which observations are returned is not necessarily the same as the order in which they are stored.
I don't know if my reading is correct or not, but that would correspond with how values within an observation are stored - the varnum/logical order of a variable is not necessarily the same as its physical location.
SAS® Certification Prep Guide - Advanced Programming for SAS®9 Fourth Edition
Page 685: By default, a SAS data file is uncompressed. You can compress your data files in order to conserve disk space, although some files are not good candidates for compression.
Page 686: Remember that in order for SAS to read a compressed file, each observation must be uncompressed. This requires more CPU resources than reading an uncompressed file.
Page 688: A file that has been compressed using the BINARY setting of the COMPRESS= option takes significantly more CPU time to uncompress than a file that was compressed with the YES or CHAR setting. BINARY is more efficient with observations that are several hundred bytes or more in length. BINARY can also be very effective with character data that contains patterns rather than simple repetitions. [binary = RDC, char = RLE]
When you create a compressed data file, SAS compares the size of the compressed file to the size of the uncompressed file of the same page size. Then SAS writes a note to the log indicating the size reduction percent that is obtained by compressing the file.
When you use either of the COMPRESS= options, SAS calculates the size of the overhead that is introduced by compression as well as the maximum size of an observation in the data set that you are attempting to compress. If the maximum size of the observation is smaller than the overhead that is introduced by compression, SAS disables compression, creates an uncompressed data set, and issues a warning message stating that the file was not compressed.
Comment
I work in a UK financial institution, where the OPTION COMPRESS=BINARY is set as a default option. The previous UK financial institution I worked for had OPTION COMPRESS=NO. My previous employer struggled so much with storage so that we had to Unix compress datasets with gzip. At 3am our ops team sent an email saying our server was 0 bytes available, with our daily batch starting at 5am. That needed datasets to be deleted to create space to gzip other files, to create space for the vital daily jobs.
Only recently have I looked at improving Run Time on some jobs at my current employer. I’ve seen the message in the log saying that uncompressed the file will take fewer pages. A temp dataset of 2.4bn obs x 64 byte record length (8x8 byte number) used 200Gb, but 2.4bn x 64 bytes is 145Gb. The values in the variables are unique (a Customer ID) a month end date (integer) an integer count and 5 floats. COMPRESS =BINARY on this dataset looks like a lose-lose. Sorting this took 9 hours, but hen proc sql (or dataset by/if last) group processing takes only 17 mins to output 23m obs and summary variables.
My current site is not so worried about storage (but large datasets and monitored and respectfully asked if they are needed). Most jobs run in good time, as the available resources have grown. A few jobs take long enough to be a problem. The best way to improve these is by cutting down the time for I/O by reducing the size/number of pages in the file.
It does make me wonder how many of our 1000s of production datasets (with relatively short record lengths) are compressed and using more space, more I/O and more run time.
I have a project coming up using 1.2bn financial transaction observations, for 20+ users. This includes a long text description. Whether to override COMPRESS=BINARY will a choice I make. I’m giving serious thought to replacing this with a 5-byte surrogate key and using a Hash Object as a non-ordered lookup to reduce record length. Dates will be 5 (or 4) bytes, CUID will likely fit in 5-bytes and if we have account number (char 15), that is another candidate for a surrogate key. I may even accept loss of precision on amounts. With a proper sort and indexes to meet expected use, I have to get a 200Gb raw dataset so that my user query run times are less than 5 mins. I will likely create subsets of records to get the run time down to less than 1 minute. If only we were a Viya site and not 9.4. none of this would be needed.
For me, the only reason for using COMPRESS=[BINARY, YES] is to reduce the number of pages used and create a smaller file and less I/O. If it does not, set = NO for that dataset. Given my company’s data (over 1000s of datasets) is probably more numeric than long text, BINARY seems a sensible default, even if it sometimes shoots you in the foot. But you don’t need to pull the trigger.
@PhilGee - Excellent commentary on where you should and should not use compression. We too have BINARY compression switched on by default. The primary driver for that is we read most of our data from SQL Server that tends to create very long character variables with lots of blank space if you don't explicitly shorten them. It is a whole lot easier just to compress rather than use LENGTH statements everywhere, and we are talking thousands of variables.
Compression suits over 90% of our datasets. For the minority where it doesn't, SAS warns you about it and you just add COMPRESS=NO when creating them. Easy really.
Another avenue that hasn't been mentioned yet is using the SPDE engine/format. We had a set of very large files and processing streams that took several days to run. Converting the masters to SPDE datasets resulted in substantial disk savings. Often 90%.
While not all applications will benefit (random I/O being one), that is certainly something to consider. SPDE also uses available memory and CPU which is both good and bad. On the largest monthly job I was politely asked if I would run that on the weekend. Oops.
--Ben
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.