I have a dataset on our network drive that is 37 GB in size. Doing a simple count of one variable using proc sql can take over an hour, which makes the dataset unusable.
I read about compress and gave it a try.
data new (compress=Yes reuse=yes); set old; run;
The dataset is now 1.5 GB and the same count query takes two minutes (still slow but way better than before).
The only negatives I've been about to find in regards to using compress are that it can make the file slower to access in some cases and apparently you can't address observations by observation number.
Is there any other reason why I shouldn't use these settings by default with larger datasets going forward?
Have you added indexes to your data set?
I did but I didn't see a performance improvement. I may need to play around with indexing a different variable or multiple variables.
My main concern is finding out later that there is a downside to compressing that I'm not aware of now.
An addendum: almost all our production datasets are stored with compress=yes. The cost in CPU cycles for uncompressing a RLE compressed dataset is negligible compared to the savings in space and therefore the savings in I/O. Only datasets where the compression rate is too small or non-existent (actual increase in physical size) are excluded.
First of all, stop working on network drives. Especially when your network is so slow. Any modern storage reads a GB in less than 10 seconds.
Other than reading by obs number, compressed datasets are unproblematic. Just keep in mind that sometimes using compress=yes can actually increase the size, so keep an eye on your logs.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.