- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
In an effort to reduce the file size, I find a macro named %squeeze, with the code here, and I try to apply it with my dataset, I feel quite strange because the result is not as what I expected.
I have a compressed dataset ex_non_trading (I get this dataset by using options compress=yes in another datastep). I follow the macro %squeeze
options compress=yes reuse=yes;
%squeeze(my.ex_non_trading, squozennn)
proc contents data=my.ex_non_trading;
run;
proc contents data=squozennn;
run;
proc means data=my.ex_non_trading;
title 'ex_non_trading';
run;
proc means data=squozennn;
title 'squozennn';
run;
and the output is like that
We can see the file sizes of two datasets are not really different.
And I have a look on the log, I saw that options=compress even reduce around 70% of the file size
NOTE: There were 10978714 observations read from the data set MY.EX_NON_TRADING. NOTE: The data set WORK.SQUOZENNN has 10978714 observations and 15 variables. NOTE: Compressing data set WORK.SQUOZENNN decreased size by 70.05 percent. Compressed is 20550 pages; un-compressed would require 68618 pages. NOTE: DATA statement used (Total process time): real time 33.49 seconds cpu time 9.29 seconds 207 proc contents data=my.ex_non_trading; 208 run; NOTE: PROCEDURE CONTENTS used (Total process time): real time 0.05 seconds cpu time 0.03 seconds 209 proc contents data=squozennn; 210 run;
And I try to run the macro %squeeze without option=compress, the output squozennn now is up to 4GB, four times compared to the original ex_non_trading .So surprise to me
And I also have a look on another document about option=compress
It documents that
Compressing a file is a process that reduces the number of bytes required to represent each observation. In a compressed file, each observation is a variable-length record, while in an uncompressed file, each observation is a fixed-length record
So, in this case, whether we need to use macro %squeeze while options=compress has done all the things? Because from my understanding, %squeeze is to help to retrieve the highest length for each variable, but option=compress did it for each observation.
Warmest regards.
P/S: And woops, I also found the macro named %squeeze1, I am wondering if any of you used to apply this code and I am wondering if it works well?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
%SQUEEZE reduces the defined length of variables, e.g. the numeric length of a date to 4.
COMPRESS reduces the used length by compressing sequences of repeated characters (mainly the blanks).
Squeezed datasets may cause problems later, if you have to combine datasets where the defined lengths differ because of the content. COMPRESS on its own never poses such a problem; there are some datasets where compressing actually increases the filesize, but not by a large margin.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
In my experience a lot of SAS sites have COMPRESS = YES switched on as a permanent session option because it can both reduce disk storage significantly as well as reducing IO. You might also try COMPRESS = BINARY as that can sometimes do better than YES.
I never bother with %SQUEEZE as requires additional processing with unpredictable results.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
%SQUEEZE reduces the defined length of variables, e.g. the numeric length of a date to 4.
COMPRESS reduces the used length by compressing sequences of repeated characters (mainly the blanks).
Squeezed datasets may cause problems later, if you have to combine datasets where the defined lengths differ because of the content. COMPRESS on its own never poses such a problem; there are some datasets where compressing actually increases the filesize, but not by a large margin.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are much better off storing your data using the SPDE engine with binary compression than any other method. And no need to end up with unpredictable variables lengths (that will give you headaches when merging) if you do that.