<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to Minimize Data? in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708644#M217810</link>
    <description>&lt;P&gt;If you are willing to unzip a sas dataset prior to everytime you use it, then zip it.&amp;nbsp; &amp;nbsp;Remember, this means you will have to implement a lot of disk writes prior to accessing the data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My suggestion was intended to show how to save a great deal of space when constructing the sas data set's, such that there would be absolutely no need for further disk activity or data management to analyze the data.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let's say you want to calculate the annual value-weighted return for each portfolio, using the daily return variable RET_VW.&amp;nbsp; And let's say you only want it for one of the deciles (say rank=1).&amp;nbsp; You would simply use:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc means data=vhave;
  where rank=1;
  by infile notsorted;
  var ret_vw;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This would have virtually zero writing to disk.&amp;nbsp; &amp;nbsp;It would read in only the&amp;nbsp;&lt;SPAN&gt;733MB (the sum of the two components of VHAVE) and produce the results.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;BUT ... if you zip the SAS7BDAT file, to use it you would have to&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;SPAN&gt;UNZIP it, writing&amp;nbsp;2.55GB (or 1.47GB if "compress=yes") to disk.&amp;nbsp; Remember, writing a GB of data to disk takes an order of magnitude more disk activity, and more time, than reading the same amount.&lt;BR /&gt;And unfortunately you can't use the "WHERE" filter to subset the unzipping process - so it writes 10 times more data to disk than you actually need.&amp;nbsp; &amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Then have SAS read that unzipped data with a PROC MEANS like the above.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;So at the minimum, that's 1.47GB writing and 1.47GB reading.&amp;nbsp; (compared to 0 writing and 733MB reading above).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;Now if you plan to essentially archive these data sets, and then infrequently access them for sustained use, zip is fine.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/173881"&gt;@Junyong&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN&gt;As to "too many datasets", I think you can get around that by use of some macro programming, such that your program takes the name of your detail dataset, creates a standard lookup dataset with a related name, and a standard data set view with its own related name.&amp;nbsp; That's a single program - not a program for each dataset.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And "too many variables":&amp;nbsp; if you are naming variables in an INPUT statement, then setting lengths for those &lt;EM&gt;&lt;STRONG&gt;with obvious integer values within a given range&lt;/STRONG&gt;&lt;/EM&gt;, should not be a problem (i.e. DATE, RANK, NSTOCKS in the above).&amp;nbsp; If you want to determine what length to assign for these vars, you can find the maximum consecutive integer for each length L from 3 to 8 - just ask SAS to assign values&amp;nbsp; X=constant("exactint",L) and print X.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;In short, yes it's an investment up front.&amp;nbsp; But if you want the simplest access to the result, the investment might be worth it.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 29 Dec 2020 21:22:09 GMT</pubDate>
    <dc:creator>mkeintz</dc:creator>
    <dc:date>2020-12-29T21:22:09Z</dc:date>
    <item>
      <title>How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708520#M217754</link>
      <description>&lt;P&gt;SAS7BDAT is straightforward but large because of metadata. The following HAVE.SAS7BDAT is 192KB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;libname desktop "!userprofile\desktop";
data desktop.have;
do i=1 to 5000;
x=rannor(1);
output;
end;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;CSV is smaller than SAS7BDAT but sacrifices the metadata. The following WANT1.CSV is 91KB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc export file="!userprofile\desktop\want1.csv";
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Or one can zip the original SAS7DAT. The following WANT2.ZIP is 85KB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;filename want2 zip "!userprofile\desktop\want2.zip";
data _null_;
infile "!userprofile\desktop\have.sas7bdat" recfm=f;
input;
file want2(have.sas7bdat) recfm=n;
put _infile_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;If the size is the only issue, then one can zip the WANT1.CSV above. The following WANT3.ZIP is 41KB—about 80% smaller than the original SAS7BDAT.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;filename want3 zip "!userprofile\desktop\want3.zip";
data _null_;
infile "!userprofile\desktop\want1.csv";
input;
file want3(want1.csv);
put _infile_;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Here are my questions.&lt;/P&gt;&lt;P&gt;1. Is this the right approach if the size is the only objective, and the metadata are unnecessary?&lt;/P&gt;&lt;P&gt;2. I think the only technical part is the RECFM in the second case—the SAS7BDAT inside the ZIP is damaged without the RECFM. Is there anything more I need to consider to avoid possible sensitivity problems like this?&lt;/P&gt;&lt;P&gt;3. I am not sure whether COMPRESS options can deal with this issue. Is there any alternative approach considerable to minimize data in SAS?&lt;/P&gt;&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 07:33:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708520#M217754</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T07:33:29Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708521#M217755</link>
      <description>&lt;P&gt;You can compress the dataset and save disk space by using&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;options compress=yes;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;preceding the data creation or you can compress it by:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have (compress=yes);
 set have;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 07:45:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708521#M217755</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-12-29T07:45:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708522#M217756</link>
      <description>&lt;P&gt;I think that any dataset smaller than 1 GB is to small to waste time to optimize the size. So please post some real-life information about the problem you try to solve. Nowadays i use file system compression on directories containing larger datasets that are seldom used. The compress-option can save a large amount of disk-space, but afaik you loose some option when processing the data.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 07:51:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708522#M217756</guid>
      <dc:creator>andreas_lds</dc:creator>
      <dc:date>2020-12-29T07:51:59Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708524#M217757</link>
      <description>&lt;P&gt;The following COMPRESS.SAS7BDAT is 256KB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;libname desktop "!userprofile\desktop";
data desktop.compress(compress=yes);
do i=1 to 5000;
x=rannor(1);
output;
end;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;This is 33% larger than the original HAVE.SAS7BDAT.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 07:58:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708524#M217757</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T07:58:59Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708525#M217758</link>
      <description>&lt;P&gt;The compress option is very useful when there are a lot of alphanumeric data.&lt;/P&gt;
&lt;P&gt;In your test case the variables are all numeric, and the observation is short. The compress omits nulls and unnecessary spaces but adds some metadata to enable un-compress. That is the reason to&amp;nbsp; size enlarged.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 08:14:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708525#M217758</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-12-29T08:14:34Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708527#M217760</link>
      <description>&lt;P&gt;They are minimum working examples. The following is one real-life example.&amp;nbsp;The following HAVE.SAS7BDAT is 2.2GB.&lt;/P&gt;&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data i;
	infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
		column=i length=j;
	do k=1 by 1 until(i&amp;gt;j);
		input l ~:$32767. @;
		if find(l,".csv") then output;
	end;
run;

libname desktop "!userprofile\desktop";

data desktop.have;
	set i;
	where ^find(l,"me") &amp;amp; find(l,"daily") | find(l,"me_daily");
	length infile $80 date 8 rank nstocks 3 ret_vw 8;
	infile=scan(scan(l,2,'"'),-1,"/");
	j=cats("http://global-q.org",scan(l,2,'"'));
	infile l url filevar=j firstobs=2 dsd end=m;
	do until(m);
		input date rank nstocks ret_vw;
		output;
	end;
	keep infile date rank nstocks ret_vw;
run;&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;It becomes 1.3GB after COMPRESS—40% smaller than the original—but 201MB with ZIP—90% smaller.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 08:26:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708527#M217760</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T08:26:15Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708528#M217761</link>
      <description>&lt;P&gt;So one needs to minimize via either CSV or ZIP rather than COMPRESS if only numeric.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 08:25:33 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708528#M217761</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T08:25:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708533#M217765</link>
      <description>&lt;P&gt;Using CSV you pay one byte per each of n-1 out of n variables while compressing.&lt;/P&gt;
&lt;P&gt;Using ZIP it might be unavailable, as much as I know, when unzipped.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 08:50:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708533#M217765</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-12-29T08:50:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708563#M217771</link>
      <description>&lt;P&gt;Perhaps a small contribution ...&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;DATE is most likely an integer.&amp;nbsp; You should be able to cut its length from 8 to 4 without losing any precision.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 13:39:25 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708563#M217771</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2020-12-29T13:39:25Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708587#M217784</link>
      <description>&lt;P&gt;If the particular example you show is representative, then you can use the LENGTH statement to save space on a variable-by-variable basis. And you can re-organize to remove the burden of repeated rows of the INFILE character variable.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In this case, you start out with 1 80-byte character variable and 4 8-byte numerics: 112 bytes/obs.&amp;nbsp; But you can&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Use the LENGTH statement to store DATE as a 4 byte var (default is 8), which will precisely store all integers with absolute value &amp;lt;=2,097,152 on a windows machine.&amp;nbsp; But you are reading date as an 8-digit integer (e.g. 19700103) which would be too large.&amp;nbsp; &amp;nbsp;So I also recommend you read date using the YYMMDD8. informat, which will convert the value into the number of days after 01jan1960.&amp;nbsp; In turn that makes every date storable in 4 bytes up to 23OCT7701&amp;nbsp; (3 bytes would only go up to 06JUN1982).&lt;BR /&gt;Savings: 4 bytes/obs.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;Rank is apparently nothing more than a decile index, i.e. it's an integer from 1 through 10.&amp;nbsp; Store it as length 3 (all integers up to 8,192).&lt;BR /&gt;Savings: 5 bytes/obs.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;It looks like NSTOCKS will also never be more than 8,192.&amp;nbsp; Store as length 3.&lt;BR /&gt;Another 5 bytes/obs.&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;LI&gt;MOST important:&amp;nbsp; You have repeated values of the character variable INFILE, stored as 80 bytes.&amp;nbsp; Make a separate dataset lookup table (with 1 obs per INFILE) with INFILE value and an integer code, (use the K in your first data step) .&amp;nbsp; Then keep K in your HAVE dataset.&lt;BR /&gt;Savings&amp;nbsp; &amp;nbsp;72 bytes/obs.&amp;nbsp; (or more if you shorten the storage length of K).&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Using your program unmodified dataset HAVE on my machine is 2,558,590,076.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Storing it compressed is&amp;nbsp;1,468,137,472&amp;nbsp; - about 42% reduction.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;But using the LENGTH statement and re-organizing as described in #4 above produces dataset sizes of&amp;nbsp;732,889,088 for HAVE and&amp;nbsp;131,072 for IFILE_LOOKUP, using the program below.&amp;nbsp; That's a 71% reduction.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data need;
	infile "http://global-q.org/testingportfolios.html" url lrecl=3276700
		column=C length=L;
	do k=1 by 1 until(C&amp;gt;L);
		input X ~:$32767. @;
		if find(X,".csv") then output;
	end;
	length K 3;
run;

data infile_lookup (keep=K infile)
    have (keep=K date rank nstocks ret_vw);
	set need;
	where ^find(X,"me") &amp;amp; find(X,"daily") | find(X,"me_daily");
	length infile $80 date 8 rank nstocks 3 ret_vw 8;
	infile=scan(scan(X,2,'"'),-1,"/");
	output infile_lookup;
	j=cats("http://global-q.org",scan(X,2,'"'));
	infile p url filevar=j firstobs=2 dsd end=m;
	do until(m);
		input date :yymmdd8. rank nstocks ret_vw;
		output have;
	end;
	format date date9.;
	length date 4 rank nstocks 3;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;And you can always create a data set view to merge these files on demand behind the scenes:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data vhave /view=vhave;
  merge have  infile_lookup;
  by K;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;VHAVE takes effectively zero bytes and zero seconds to create.&amp;nbsp; But you can treat it as a dataset file for all subsequent analysis programs.&amp;nbsp; (i.e. proc reg data=vhave;&amp;nbsp; &amp;nbsp;by k; .... ).&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 16:32:28 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708587#M217784</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-12-29T16:32:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708613#M217798</link>
      <description>&lt;P&gt;Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 18:05:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708613#M217798</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T18:05:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708644#M217810</link>
      <description>&lt;P&gt;If you are willing to unzip a sas dataset prior to everytime you use it, then zip it.&amp;nbsp; &amp;nbsp;Remember, this means you will have to implement a lot of disk writes prior to accessing the data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My suggestion was intended to show how to save a great deal of space when constructing the sas data set's, such that there would be absolutely no need for further disk activity or data management to analyze the data.&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Let's say you want to calculate the annual value-weighted return for each portfolio, using the daily return variable RET_VW.&amp;nbsp; And let's say you only want it for one of the deciles (say rank=1).&amp;nbsp; You would simply use:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc means data=vhave;
  where rank=1;
  by infile notsorted;
  var ret_vw;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This would have virtually zero writing to disk.&amp;nbsp; &amp;nbsp;It would read in only the&amp;nbsp;&lt;SPAN&gt;733MB (the sum of the two components of VHAVE) and produce the results.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;BUT ... if you zip the SAS7BDAT file, to use it you would have to&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;SPAN&gt;UNZIP it, writing&amp;nbsp;2.55GB (or 1.47GB if "compress=yes") to disk.&amp;nbsp; Remember, writing a GB of data to disk takes an order of magnitude more disk activity, and more time, than reading the same amount.&lt;BR /&gt;And unfortunately you can't use the "WHERE" filter to subset the unzipping process - so it writes 10 times more data to disk than you actually need.&amp;nbsp; &amp;nbsp;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;Then have SAS read that unzipped data with a PROC MEANS like the above.&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN&gt;So at the minimum, that's 1.47GB writing and 1.47GB reading.&amp;nbsp; (compared to 0 writing and 733MB reading above).&lt;/SPAN&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;SPAN&gt;Now if you plan to essentially archive these data sets, and then infrequently access them for sustained use, zip is fine.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/173881"&gt;@Junyong&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;
&lt;P&gt;Thanks for this detailed advice, but I wonder whether simply zipping SAS7BDAT (or even using CSV if metadata are unnecessary) can walk around this micromanagement if there are too many datasets and variables, for example.&lt;/P&gt;
&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;&lt;SPAN&gt;As to "too many datasets", I think you can get around that by use of some macro programming, such that your program takes the name of your detail dataset, creates a standard lookup dataset with a related name, and a standard data set view with its own related name.&amp;nbsp; That's a single program - not a program for each dataset.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And "too many variables":&amp;nbsp; if you are naming variables in an INPUT statement, then setting lengths for those &lt;EM&gt;&lt;STRONG&gt;with obvious integer values within a given range&lt;/STRONG&gt;&lt;/EM&gt;, should not be a problem (i.e. DATE, RANK, NSTOCKS in the above).&amp;nbsp; If you want to determine what length to assign for these vars, you can find the maximum consecutive integer for each length L from 3 to 8 - just ask SAS to assign values&amp;nbsp; X=constant("exactint",L) and print X.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;In short, yes it's an investment up front.&amp;nbsp; But if you want the simplest access to the result, the investment might be worth it.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 21:22:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708644#M217810</guid>
      <dc:creator>mkeintz</dc:creator>
      <dc:date>2020-12-29T21:22:09Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708649#M217813</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/173881"&gt;@Junyong&lt;/a&gt;&amp;nbsp; - I'm a strong believer in no micromanagement of SAS dataset sizes as well. I've found from experience that setting COMPRESS = YES or BINARY on all SAS sessions works well for us. We can get 80% plus compression on large datasets with many character variables with the BINARY option. We prefer not to use ZIP because of the extra management involved.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With disk space being comparatively cheap these days it is more cost effective to just allocate more space than to waste expensive SAS professionals' time optimising storage - they have better things to do...&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 21:44:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708649#M217813</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2020-12-29T21:44:24Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708655#M217817</link>
      <description>&lt;P&gt;Thanks again for these helpful details. I agree that unzipping later requires additional computing resources, but what I was more thinking is to minimize some ready-made data before sharing on the Internet—like &lt;A href="http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html" target="_blank"&gt;these&lt;/A&gt; &lt;A href="http://global-q.org/testingportfolios.html" target="_blank"&gt;two&lt;/A&gt; sites. The objective is to minimize server-side loadings rather than end users' client-side burdens.&lt;/P&gt;&lt;P&gt;The former site caters (1) zipped CSVs while the latter does (2) CSVs. I also see some researchers who provide (3) either XLSs or XLSXs or others with (4) zipped SAS7BDATs (ignoring R, Python, etc.). Though 7Zs are another possibility, I am not considering them because SAS cannot handle directly.&lt;/P&gt;&lt;P&gt;(5) Optimized (as you mentioned, fitting the slots saves the storage) and presorted SAS7BDATs are more efficient and expedite next steps, but I thought zipping is also beneficial in this context.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 22:35:52 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708655#M217817</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T22:35:52Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708657#M217819</link>
      <description>&lt;P&gt;Yes, HDDs are cheaper than before. As mentioned above, I was also thinking possible traffics on the Internet. It seemed number rather than character data are more benefited via zipping rather than compressing.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 22:44:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708657#M217819</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T22:44:17Z</dc:date>
    </item>
    <item>
      <title>Re: How to Minimize Data?</title>
      <link>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708658#M217820</link>
      <description>&lt;P&gt;I also though "What if 3 rather than 4?" just for a moment and realized 8,192 is inappropriate, but thanks again.&lt;/P&gt;</description>
      <pubDate>Tue, 29 Dec 2020 22:47:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/How-to-Minimize-Data/m-p/708658#M217820</guid>
      <dc:creator>Junyong</dc:creator>
      <dc:date>2020-12-29T22:47:30Z</dc:date>
    </item>
  </channel>
</rss>

