SAS7BDAT is straightforward but large because of metadata. The following HAVE.SAS7BDAT is 192KB.
libname desktop "!userprofile\desktop";
data desktop.have;
do i=1 to 5000;
x=rannor(1);
output;
end;
run;
CSV is smaller than SAS7BDAT but sacrifices the metadata. The following WANT1.CSV is 91KB.
proc export file="!userprofile\desktop\want1.csv";
run;
Or one can zip the original SAS7DAT. The following WANT2.ZIP is 85KB.
filename want2 zip "!userprofile\desktop\want2.zip";
data _null_;
infile "!userprofile\desktop\have.sas7bdat" recfm=f;
input;
file want2(have.sas7bdat) recfm=n;
put _infile_;
run;
If the size is the only issue, then one can zip the WANT1.CSV above. The following WANT3.ZIP is 41KB—about 80% smaller than the original SAS7BDAT.
filename want3 zip "!userprofile\desktop\want3.zip";
data _null_;
infile "!userprofile\desktop\want1.csv";
input;
file want3(want1.csv);
put _infile_;
run;
Here are my questions.
1. Is this the right approach if the size is the only objective, and the metadata are unnecessary?
2. I think the only technical part is the RECFM in the second case—the SAS7BDAT inside the ZIP is damaged without the RECFM. Is there anything more I need to consider to avoid possible sensitivity problems like this?
3. I am not sure whether COMPRESS options can deal with this issue. Is there any alternative approach considerable to minimize data in SAS?
Thanks.
If the particular example you show is representative, then you can use the LENGTH statement to save space on a variable-by-variable basis. And you can re-organize to remove the burden of repeated rows of the INFILE character variable.
In this case, you start out with 1 80-byte character variable and 4 8-byte numerics: 112 bytes/obs. But you can
Using your program unmodified dataset HAVE on my machine is 2,558,590,076.
Storing it compressed is 1,468,137,472 - about 42% reduction.
But using the LENGTH statement and re-organizing as described in #4 above produces dataset sizes of 732,889,088 for HAVE and 131,072 for IFILE_LOOKUP, using the program below. That's a 71% reduction.
data need;
infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
column=C length=L;
do k=1 by 1 until(C>L);
input X ~:$32767. @;
if find(X,".csv") then output;
end;
length K 3;
run;
data infile_lookup (keep=K infile)
have (keep=K date rank nstocks ret_vw);
set need;
where ^find(X,"me") & find(X,"daily") | find(X,"me_daily");
length infile $80 date 8 rank nstocks 3 ret_vw 8;
infile=scan(scan(X,2,'"'),-1,"/");
output infile_lookup;
j=cats("http://global-q.org",scan(X,2,'"'));
infile p url filevar=j firstobs=2 dsd end=m;
do until(m);
input date :yymmdd8. rank nstocks ret_vw;
output have;
end;
format date date9.;
length date 4 rank nstocks 3;
run;
And you can always create a data set view to merge these files on demand behind the scenes:
data vhave /view=vhave;
merge have infile_lookup;
by K;
run;
VHAVE takes effectively zero bytes and zero seconds to create. But you can treat it as a dataset file for all subsequent analysis programs. (i.e. proc reg data=vhave; by k; .... ).
You can compress the dataset and save disk space by using
options compress=yes;
preceding the data creation or you can compress it by:
data have (compress=yes);
set have;
run;
The following COMPRESS.SAS7BDAT is 256KB.
libname desktop "!userprofile\desktop";
data desktop.compress(compress=yes);
do i=1 to 5000;
x=rannor(1);
output;
end;
run;
This is 33% larger than the original HAVE.SAS7BDAT.
The compress option is very useful when there are a lot of alphanumeric data.
In your test case the variables are all numeric, and the observation is short. The compress omits nulls and unnecessary spaces but adds some metadata to enable un-compress. That is the reason to size enlarged.
So one needs to minimize via either CSV or ZIP rather than COMPRESS if only numeric.
Using CSV you pay one byte per each of n-1 out of n variables while compressing.
Using ZIP it might be unavailable, as much as I know, when unzipped.
I think that any dataset smaller than 1 GB is to small to waste time to optimize the size. So please post some real-life information about the problem you try to solve. Nowadays i use file system compression on directories containing larger datasets that are seldom used. The compress-option can save a large amount of disk-space, but afaik you loose some option when processing the data.
They are minimum working examples. The following is one real-life example. The following HAVE.SAS7BDAT is 2.2GB.
data i;
infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
column=i length=j;
do k=1 by 1 until(i>j);
input l ~:$32767. @;
if find(l,".csv") then output;
end;
run;
libname desktop "!userprofile\desktop";
data desktop.have;
set i;
where ^find(l,"me") & find(l,"daily") | find(l,"me_daily");
length infile $80 date 8 rank nstocks 3 ret_vw 8;
infile=scan(scan(l,2,'"'),-1,"/");
j=cats("http://global-q.org",scan(l,2,'"'));
infile l url filevar=j firstobs=2 dsd end=m;
do until(m);
input date rank nstocks ret_vw;
output;
end;
keep infile date rank nstocks ret_vw;
run;
It becomes 1.3GB after COMPRESS—40% smaller than the original—but 201MB with ZIP—90% smaller.
Perhaps a small contribution ...
DATE is most likely an integer. You should be able to cut its length from 8 to 4 without losing any precision.
I also though "What if 3 rather than 4?" just for a moment and realized 8,192 is inappropriate, but thanks again.
If the particular example you show is representative, then you can use the LENGTH statement to save space on a variable-by-variable basis. And you can re-organize to remove the burden of repeated rows of the INFILE character variable.
In this case, you start out with 1 80-byte character variable and 4 8-byte numerics: 112 bytes/obs. But you can
Using your program unmodified dataset HAVE on my machine is 2,558,590,076.
Storing it compressed is 1,468,137,472 - about 42% reduction.
But using the LENGTH statement and re-organizing as described in #4 above produces dataset sizes of 732,889,088 for HAVE and 131,072 for IFILE_LOOKUP, using the program below. That's a 71% reduction.
data need;
infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
column=C length=L;
do k=1 by 1 until(C>L);
input X ~:$32767. @;
if find(X,".csv") then output;
end;
length K 3;
run;
data infile_lookup (keep=K infile)
have (keep=K date rank nstocks ret_vw);
set need;
where ^find(X,"me") & find(X,"daily") | find(X,"me_daily");
length infile $80 date 8 rank nstocks 3 ret_vw 8;
infile=scan(scan(X,2,'"'),-1,"/");
output infile_lookup;
j=cats("http://global-q.org",scan(X,2,'"'));
infile p url filevar=j firstobs=2 dsd end=m;
do until(m);
input date :yymmdd8. rank nstocks ret_vw;
output have;
end;
format date date9.;
length date 4 rank nstocks 3;
run;
And you can always create a data set view to merge these files on demand behind the scenes:
data vhave /view=vhave;
merge have infile_lookup;
by K;
run;
VHAVE takes effectively zero bytes and zero seconds to create. But you can treat it as a dataset file for all subsequent analysis programs. (i.e. proc reg data=vhave; by k; .... ).
Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.
If you are willing to unzip a sas dataset prior to everytime you use it, then zip it. Remember, this means you will have to implement a lot of disk writes prior to accessing the data.
My suggestion was intended to show how to save a great deal of space when constructing the sas data set's, such that there would be absolutely no need for further disk activity or data management to analyze the data.
Let's say you want to calculate the annual value-weighted return for each portfolio, using the daily return variable RET_VW. And let's say you only want it for one of the deciles (say rank=1). You would simply use:
proc means data=vhave;
where rank=1;
by infile notsorted;
var ret_vw;
run;
This would have virtually zero writing to disk. It would read in only the 733MB (the sum of the two components of VHAVE) and produce the results.
BUT ... if you zip the SAS7BDAT file, to use it you would have to
Now if you plan to essentially archive these data sets, and then infrequently access them for sustained use, zip is fine.
@Junyong wrote:
Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.
As to "too many datasets", I think you can get around that by use of some macro programming, such that your program takes the name of your detail dataset, creates a standard lookup dataset with a related name, and a standard data set view with its own related name. That's a single program - not a program for each dataset.
And "too many variables": if you are naming variables in an INPUT statement, then setting lengths for those with obvious integer values within a given range, should not be a problem (i.e. DATE, RANK, NSTOCKS in the above). If you want to determine what length to assign for these vars, you can find the maximum consecutive integer for each length L from 3 to 8 - just ask SAS to assign values X=constant("exactint",L) and print X.
In short, yes it's an investment up front. But if you want the simplest access to the result, the investment might be worth it.
Thanks again for these helpful details. I agree that unzipping later requires additional computing resources, but what I was more thinking is to minimize some ready-made data before sharing on the Internet—like these two sites. The objective is to minimize server-side loadings rather than end users' client-side burdens.
The former site caters (1) zipped CSVs while the latter does (2) CSVs. I also see some researchers who provide (3) either XLSs or XLSXs or others with (4) zipped SAS7BDATs (ignoring R, Python, etc.). Though 7Zs are another possibility, I am not considering them because SAS cannot handle directly.
(5) Optimized (as you mentioned, fitting the slots saves the storage) and presorted SAS7BDATs are more efficient and expedite next steps, but I thought zipping is also beneficial in this context.
@Junyong - I'm a strong believer in no micromanagement of SAS dataset sizes as well. I've found from experience that setting COMPRESS = YES or BINARY on all SAS sessions works well for us. We can get 80% plus compression on large datasets with many character variables with the BINARY option. We prefer not to use ZIP because of the extra management involved.
With disk space being comparatively cheap these days it is more cost effective to just allocate more space than to waste expensive SAS professionals' time optimising storage - they have better things to do...
It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.