Solved: Re: How to Minimize Data?

Junyong · Posted 12-29-2020 02:31 AM

SAS7BDAT is straightforward but large because of metadata. The following HAVE.SAS7BDAT is 192KB.

libname desktop "!userprofile\desktop";
data desktop.have;
do i=1 to 5000;
x=rannor(1);
output;
end;
run;

CSV is smaller than SAS7BDAT but sacrifices the metadata. The following WANT1.CSV is 91KB.

proc export file="!userprofile\desktop\want1.csv";
run;

Or one can zip the original SAS7DAT. The following WANT2.ZIP is 85KB.

filename want2 zip "!userprofile\desktop\want2.zip";
data _null_;
infile "!userprofile\desktop\have.sas7bdat" recfm=f;
input;
file want2(have.sas7bdat) recfm=n;
put _infile_;
run;

If the size is the only issue, then one can zip the WANT1.CSV above. The following WANT3.ZIP is 41KB—about 80% smaller than the original SAS7BDAT.

filename want3 zip "!userprofile\desktop\want3.zip";
data _null_;
infile "!userprofile\desktop\want1.csv";
input;
file want3(want1.csv);
put _infile_;
run;

Here are my questions.

1. Is this the right approach if the size is the only objective, and the metadata are unnecessary?

2. I think the only technical part is the RECFM in the second case—the SAS7BDAT inside the ZIP is damaged without the RECFM. Is there anything more I need to consider to avoid possible sensitivity problems like this?

3. I am not sure whether COMPRESS options can deal with this issue. Is there any alternative approach considerable to minimize data in SAS?

Thanks.

mkeintz · Posted 12-29-2020 11:32 AM

If the particular example you show is representative, then you can use the LENGTH statement to save space on a variable-by-variable basis. And you can re-organize to remove the burden of repeated rows of the INFILE character variable.

In this case, you start out with 1 80-byte character variable and 4 8-byte numerics: 112 bytes/obs. But you can

Use the LENGTH statement to store DATE as a 4 byte var (default is 8), which will precisely store all integers with absolute value <=2,097,152 on a windows machine. But you are reading date as an 8-digit integer (e.g. 19700103) which would be too large. So I also recommend you read date using the YYMMDD8. informat, which will convert the value into the number of days after 01jan1960. In turn that makes every date storable in 4 bytes up to 23OCT7701 (3 bytes would only go up to 06JUN1982).
Savings: 4 bytes/obs.
Rank is apparently nothing more than a decile index, i.e. it's an integer from 1 through 10. Store it as length 3 (all integers up to 8,192).
Savings: 5 bytes/obs.
It looks like NSTOCKS will also never be more than 8,192. Store as length 3.
Another 5 bytes/obs.
MOST important: You have repeated values of the character variable INFILE, stored as 80 bytes. Make a separate dataset lookup table (with 1 obs per INFILE) with INFILE value and an integer code, (use the K in your first data step) . Then keep K in your HAVE dataset.
Savings 72 bytes/obs. (or more if you shorten the storage length of K).

Using your program unmodified dataset HAVE on my machine is 2,558,590,076.

Storing it compressed is 1,468,137,472 - about 42% reduction.

But using the LENGTH statement and re-organizing as described in #4 above produces dataset sizes of 732,889,088 for HAVE and 131,072 for IFILE_LOOKUP, using the program below. That's a 71% reduction.

data need;
	infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
		column=C length=L;
	do k=1 by 1 until(C>L);
		input X ~:$32767. @;
		if find(X,".csv") then output;
	end;
	length K 3;
run;

data infile_lookup (keep=K infile)
    have (keep=K date rank nstocks ret_vw);
	set need;
	where ^find(X,"me") & find(X,"daily") | find(X,"me_daily");
	length infile $80 date 8 rank nstocks 3 ret_vw 8;
	infile=scan(scan(X,2,'"'),-1,"/");
	output infile_lookup;
	j=cats("http://global-q.org",scan(X,2,'"'));
	infile p url filevar=j firstobs=2 dsd end=m;
	do until(m);
		input date :yymmdd8. rank nstocks ret_vw;
		output have;
	end;
	format date date9.;
	length date 4 rank nstocks 3;
run;

And you can always create a data set view to merge these files on demand behind the scenes:

data vhave /view=vhave;
  merge have  infile_lookup;
  by K;
run;

VHAVE takes effectively zero bytes and zero seconds to create. But you can treat it as a dataset file for all subsequent analysis programs. (i.e. proc reg data=vhave; by k; .... ).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

Shmuel · Posted 12-29-2020 02:45 AM

You can compress the dataset and save disk space by using

options compress=yes;

preceding the data creation or you can compress it by:

data have (compress=yes);
 set have;
run;

Junyong · Posted 12-29-2020 02:58 AM

The following COMPRESS.SAS7BDAT is 256KB.

libname desktop "!userprofile\desktop";
data desktop.compress(compress=yes);
do i=1 to 5000;
x=rannor(1);
output;
end;
run;

This is 33% larger than the original HAVE.SAS7BDAT.

Shmuel · Posted 12-29-2020 03:14 AM

The compress option is very useful when there are a lot of alphanumeric data.

In your test case the variables are all numeric, and the observation is short. The compress omits nulls and unnecessary spaces but adds some metadata to enable un-compress. That is the reason to size enlarged.

Junyong · Posted 12-29-2020 03:25 AM

So one needs to minimize via either CSV or ZIP rather than COMPRESS if only numeric.

Shmuel · Posted 12-29-2020 03:50 AM

Using CSV you pay one byte per each of n-1 out of n variables while compressing.

Using ZIP it might be unavailable, as much as I know, when unzipped.

andreas_lds · Posted 12-29-2020 02:51 AM

I think that any dataset smaller than 1 GB is to small to waste time to optimize the size. So please post some real-life information about the problem you try to solve. Nowadays i use file system compression on directories containing larger datasets that are seldom used. The compress-option can save a large amount of disk-space, but afaik you loose some option when processing the data.

Junyong · Posted 12-29-2020 03:18 AM

They are minimum working examples. The following is one real-life example. The following HAVE.SAS7BDAT is 2.2GB.

data i;
	infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
		column=i length=j;
	do k=1 by 1 until(i>j);
		input l ~:$32767. @;
		if find(l,".csv") then output;
	end;
run;

libname desktop "!userprofile\desktop";

data desktop.have;
	set i;
	where ^find(l,"me") & find(l,"daily") | find(l,"me_daily");
	length infile $80 date 8 rank nstocks 3 ret_vw 8;
	infile=scan(scan(l,2,'"'),-1,"/");
	j=cats("http://global-q.org",scan(l,2,'"'));
	infile l url filevar=j firstobs=2 dsd end=m;
	do until(m);
		input date rank nstocks ret_vw;
		output;
	end;
	keep infile date rank nstocks ret_vw;
run;

It becomes 1.3GB after COMPRESS—40% smaller than the original—but 201MB with ZIP—90% smaller.

Astounding · Posted 12-29-2020 08:39 AM

Perhaps a small contribution ...

DATE is most likely an integer. You should be able to cut its length from 8 to 4 without losing any precision.

Junyong · Posted 12-29-2020 05:47 PM

I also though "What if 3 rather than 4?" just for a moment and realized 8,192 is inappropriate, but thanks again.

mkeintz · Posted 12-29-2020 11:32 AM

If the particular example you show is representative, then you can use the LENGTH statement to save space on a variable-by-variable basis. And you can re-organize to remove the burden of repeated rows of the INFILE character variable.

In this case, you start out with 1 80-byte character variable and 4 8-byte numerics: 112 bytes/obs. But you can

Use the LENGTH statement to store DATE as a 4 byte var (default is 8), which will precisely store all integers with absolute value <=2,097,152 on a windows machine. But you are reading date as an 8-digit integer (e.g. 19700103) which would be too large. So I also recommend you read date using the YYMMDD8. informat, which will convert the value into the number of days after 01jan1960. In turn that makes every date storable in 4 bytes up to 23OCT7701 (3 bytes would only go up to 06JUN1982).
Savings: 4 bytes/obs.
Rank is apparently nothing more than a decile index, i.e. it's an integer from 1 through 10. Store it as length 3 (all integers up to 8,192).
Savings: 5 bytes/obs.
It looks like NSTOCKS will also never be more than 8,192. Store as length 3.
Another 5 bytes/obs.
MOST important: You have repeated values of the character variable INFILE, stored as 80 bytes. Make a separate dataset lookup table (with 1 obs per INFILE) with INFILE value and an integer code, (use the K in your first data step) . Then keep K in your HAVE dataset.
Savings 72 bytes/obs. (or more if you shorten the storage length of K).

Using your program unmodified dataset HAVE on my machine is 2,558,590,076.

Storing it compressed is 1,468,137,472 - about 42% reduction.

But using the LENGTH statement and re-organizing as described in #4 above produces dataset sizes of 732,889,088 for HAVE and 131,072 for IFILE_LOOKUP, using the program below. That's a 71% reduction.

data need;
	infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
		column=C length=L;
	do k=1 by 1 until(C>L);
		input X ~:$32767. @;
		if find(X,".csv") then output;
	end;
	length K 3;
run;

data infile_lookup (keep=K infile)
    have (keep=K date rank nstocks ret_vw);
	set need;
	where ^find(X,"me") & find(X,"daily") | find(X,"me_daily");
	length infile $80 date 8 rank nstocks 3 ret_vw 8;
	infile=scan(scan(X,2,'"'),-1,"/");
	output infile_lookup;
	j=cats("http://global-q.org",scan(X,2,'"'));
	infile p url filevar=j firstobs=2 dsd end=m;
	do until(m);
		input date :yymmdd8. rank nstocks ret_vw;
		output have;
	end;
	format date date9.;
	length date 4 rank nstocks 3;
run;

And you can always create a data set view to merge these files on demand behind the scenes:

data vhave /view=vhave;
  merge have  infile_lookup;
  by K;
run;

VHAVE takes effectively zero bytes and zero seconds to create. But you can treat it as a dataset file for all subsequent analysis programs. (i.e. proc reg data=vhave; by k; .... ).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Junyong · Posted 12-29-2020 01:05 PM

Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.

mkeintz · Posted 12-29-2020 04:20 PM

If you are willing to unzip a sas dataset prior to everytime you use it, then zip it. Remember, this means you will have to implement a lot of disk writes prior to accessing the data.

My suggestion was intended to show how to save a great deal of space when constructing the sas data set's, such that there would be absolutely no need for further disk activity or data management to analyze the data.

Let's say you want to calculate the annual value-weighted return for each portfolio, using the daily return variable RET_VW. And let's say you only want it for one of the deciles (say rank=1). You would simply use:

proc means data=vhave;
  where rank=1;
  by infile notsorted;
  var ret_vw;
run;

This would have virtually zero writing to disk. It would read in only the 733MB (the sum of the two components of VHAVE) and produce the results.

BUT ... if you zip the SAS7BDAT file, to use it you would have to

UNZIP it, writing 2.55GB (or 1.47GB if "compress=yes") to disk. Remember, writing a GB of data to disk takes an order of magnitude more disk activity, and more time, than reading the same amount.
And unfortunately you can't use the "WHERE" filter to subset the unzipping process - so it writes 10 times more data to disk than you actually need.
Then have SAS read that unzipped data with a PROC MEANS like the above.
So at the minimum, that's 1.47GB writing and 1.47GB reading. (compared to 0 writing and 733MB reading above).

Now if you plan to essentially archive these data sets, and then infrequently access them for sustained use, zip is fine.

@Junyong wrote:

Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.

As to "too many datasets", I think you can get around that by use of some macro programming, such that your program takes the name of your detail dataset, creates a standard lookup dataset with a related name, and a standard data set view with its own related name. That's a single program - not a program for each dataset.

And "too many variables": if you are naming variables in an INPUT statement, then setting lengths for those with obvious integer values within a given range, should not be a problem (i.e. DATE, RANK, NSTOCKS in the above). If you want to determine what length to assign for these vars, you can find the maximum consecutive integer for each length L from 3 to 8 - just ask SAS to assign values X=constant("exactint",L) and print X.

In short, yes it's an investment up front. But if you want the simplest access to the result, the investment might be worth it.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Junyong · Posted 12-29-2020 05:35 PM

Thanks again for these helpful details. I agree that unzipping later requires additional computing resources, but what I was more thinking is to minimize some ready-made data before sharing on the Internet—like these two sites. The objective is to minimize server-side loadings rather than end users' client-side burdens.

The former site caters (1) zipped CSVs while the latter does (2) CSVs. I also see some researchers who provide (3) either XLSs or XLSXs or others with (4) zipped SAS7BDATs (ignoring R, Python, etc.). Though 7Zs are another possibility, I am not considering them because SAS cannot handle directly.

(5) Optimized (as you mentioned, fitting the slots saves the storage) and presorted SAS7BDATs are more efficient and expedite next steps, but I thought zipping is also beneficial in this context.

SASKiwi · Posted 12-29-2020 04:44 PM

@Junyong - I'm a strong believer in no micromanagement of SAS dataset sizes as well. I've found from experience that setting COMPRESS = YES or BINARY on all SAS sessions works well for us. We can get 80% plus compression on large datasets with many character variables with the BINARY option. We prefer not to use ZIP because of the extra management involved.

With disk space being comparatively cheap these days it is more cost effective to just allocate more space than to waste expensive SAS professionals' time optimising storage - they have better things to do...

Registration is open

SAS Training: Just a Click Away