BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tedway
Obsidian | Level 7

I have a dataset on our network drive that is 37 GB in size. Doing a simple count of one variable using proc sql can take over an hour, which makes the dataset unusable. 

 

I read about compress and gave it a try. 

 

data new (compress=Yes reuse=yes);
set old;
run;

The dataset is now 1.5 GB and the same count query takes two minutes (still slow but way better than before). 

 

The only negatives I've been about to find in regards to using compress are that it can make the file slower to access in some cases and apparently you can't address observations by observation number. 

 

Is there any other reason why I shouldn't use these settings by default with larger datasets going forward? 

1 ACCEPTED SOLUTION

Accepted Solutions
LinusH
Tourmaline | Level 20
Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.
Data never sleeps

View solution in original post

5 REPLIES 5
Reeza
Super User

Have you added indexes to your data set? 

tedway
Obsidian | Level 7

I did but I didn't see a performance improvement. I may need to play around with indexing a different variable or multiple variables. 

 

My main concern is finding out later that there is a downside to compressing that I'm not aware of now. 

Kurt_Bremser
Super User

An addendum: almost all our production datasets are stored with compress=yes. The cost in CPU cycles for uncompressing a RLE compressed dataset is negligible compared to the savings in space and therefore the savings in I/O. Only datasets where the compression rate is too small or non-existent (actual increase in physical size) are excluded.

Kurt_Bremser
Super User

First of all, stop working on network drives. Especially when your network is so slow. Any modern storage reads a GB in less than 10 seconds.

Other than reading by obs number, compressed datasets are unproblematic. Just keep in mind that sometimes using compress=yes can actually increase the size, so keep an eye on your logs.

LinusH
Tourmaline | Level 20
Indexes must be designed with the most common queries in mind. And for Base SAS data sets a general rule is they only perform when they return less than 10% of the observations.

Reuse makes sense only when you delete a lot of observations in place.

The main downside with compress is that it requires more CPU cycles in both read and write operations. But in your case it seems that the slow IO pays off the compression.

Options msglevel=i;
gives feedback in the log about both compression and index usage.
Data never sleeps

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1815 views
  • 1 like
  • 4 in conversation