Efficiency Matters: Compressing SAS Datasets

7 Likes

In a data-driven world, the size of datasets continues to grow. As data volumes increase, so does the time required to store, access, and process that data. One common suggestion for improving efficiency and reducing storage requirements is to enable dataset compression. Continuing my "Efficiency Matters" series, this post unpacks how SAS compression works and when it makes sense to use it.

By default, observations in SAS datasets are stored on disk in a fixed-length format. This means that each row takes up the same amount of space, regardless of how much of that space is actually used. In many cases, the byte length assigned to a variable is longer than what is needed to store its values. Compression helps address this inefficiency by representing repeating values or unused space more efficiently—such as blank spaces in character variables or zeros in the binary representation of numeric variables.

COMPRESS=YES | CHAR

If you have a dataset that predominantly contains character variables with repeated runs of the same character (including spaces), you can specify COMPRESS=YES or COMPRESS=CHAR on the DATA statement.

For example, the following DATA step creates a dataset with three variables and 100,000 observations:

data CharRepeats(compress=char);
     length RepeatingChar $ 200 FName $500 Lname $500;
     do i=1 to 100000;
          RepeatingChar='aaaaaaaaaaaaaaaaaaaaaa';
          FName='Carleigh';
          LName='Crabtree';
          output;
     end;
     drop i;
run;

WORK.CHARREPEATS Partial Output:

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Notice the variable RepeatingChar contains a repeated character pattern and has a length of 200. The FName and LName variables contain strings of only 8 and 9 characters, respectively, but each has a length of 500. This means there are hundreds of trailing spaces in each observation that take up unnecessary disk space.

By specifying COMPRESS=CHAR, SAS can represent repeated characters and blank spaces much more efficiently. When this DATA step runs, the SAS log reports the space savings:

In this example, the dataset size is reduced by nearly 97%. For larger datasets, this kind of reduction can lead to meaningful improvements in storage usage and I/O performance.

COMPRESS=BINARY

If your dataset predominantly contains numeric variables, the COMPRESS=BINARY option is typically more effective. This option compresses repeated byte patterns in the binary representation of numeric values.

The following example creates a dataset with 200 numeric variables that store yes/no survey responses, represented as 1 for yes and 2 for no.

data survey_random(compress=binary);
     length customer_id 8;
     array q[200] Q1-Q200;
     do customer_id = 1 to 20000;
          do i = 1 to dim(q);
               /* Random Yes/No response */
               q[i] = ceil(ranuni(12345)*2); /* 1 or 2 */
          end;
          output;
     end;
     drop i;
run;

WORK.SURVEY_RANDOM Partial Output:

When COMPRESS=BINARY is specified, the dataset size is reduced by approximately 77%.

To understand why this works, consider the binary representation of the numbers 1 and 2:

data test1;
     x=1;
     y=2;
     put x=;
     put x=binary64.;
     put y=;
     put y=binary64.;
run;

Partial log generated from the PUT statements:

Both values contain many repeating zeros in their binary form. When a dataset is compressed using COMPRESS=BINARY, those repeating zeros are stored much more efficiently, resulting in significant space savings for numeric-heavy data.

COMPRESS=NO

It’s important to note that compression is a permanent attribute of a SAS dataset. Once a dataset is created with compression enabled, you cannot toggle compression on or off in place. To remove compression, you must recreate the dataset and explicitly specify COMPRESS=NO.

Conclusion

Dataset compression is a simple but powerful way to improve storage efficiency in SAS. By understanding how COMPRESS=CHAR and COMPRESS=BINARY work— and choosing the option that aligns with your data types— you can significantly reduce disk usage and potentially improve I/O performance with minimal effort. As with any optimization, compression isn’t a one-size-fits-all solution, but when applied thoughtfully, it can be an easy win for more efficient SAS processing.

For more on improving SAS program efficiency, check out the course: Improving SAS Program Efficiency.

Code for this post is downloadable from my GitHub here.

Find more articles from SAS Global Enablement and Learning here.

SASKiwi · ‎02-16-2026

@CarleighJoC - This is a timely article! As a SAS administrator I ensure that COMPRESS = BINARY is set by default in all our SAS installations so users don't need to switch it on. Due to the proliferation of external databases defining many columns as strings or large varchars, reading these in SAS can cause massive space blowouts so you really need settings in place to deal with this.

CarleighJoC · ‎02-16-2026

@SASKiwi Thank you for sharing your experience! I'm glad to see this has been an effective technique in your work as a SAS administrator.

touwen_k · ‎02-26-2026

@CarleighJoC thank you for this article, can this be used in both SAS 9.4 and Viya 4? I am also SAS Admin and we are considering to add it to restricted options. Could you also advise what the best method is to figure out if we need

compress=char

or binary?

SASKiwi · ‎02-26-2026

@touwen_k - Run some tests on your typical datasets using both CHAR and BINARY and compare the compression percentage. Go with the one that provides the highest compression. Compression works in Viya.

CarleighJoC · ‎02-27-2026

@touwen_k I'm glad you found this article helpful! I agree with @SASKiwi - Compression can be used on both SAS 9.4 and Viya 4. In Viya, it may be especially helpful if you prefer to work with data on the Compute server. I also agree the best way to determine which to use is to do some benchmarking to see which provides the highest compression (which automatically is printed as a note to the log). In addition, it's helpful to know your data. If you have a dataset with predominately numeric variables, I'd start testing with compress=binary. If you have a dataset with predominately character variables, I'd start testing with compress=char. Keep in mind, compression takes a certain amount of space to store the compressed data. This is not always effective if your data is small. For example, I took the DATA step used to demo compress=binary and reduced the number of question variables to 10, and rows to 200. When I ran it, the data set size actually increased by 100%! This is why testing is important. Here is the code I ran for that- you can copy/ paste and run it right into your own environment and compare it to the example from compress=binary.

data survey_random(compress=binary);
     length customer_id 8;
     array q[10] Q1-Q10;
     do customer_id = 1 to 200;
          do i = 1 to dim(q);
               /* Random Yes/No response */
               q[i] = ceil(ranuni(12345)*2); /* 1 or 2 */
          end;
          output;
     end;
     drop i;
run;

touwen_k · ‎03-06-2026

Thank you both for your suggestions, much appreciated!