<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Unexpected Size of Dataset in SAS Data Management</title>
    <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561630#M17203</link>
    <description>&lt;P&gt;This is the data step version:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data project.ds3 (reuse=YES) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Compression is done automatically based on the setup of my SAS-EG&lt;/P&gt;</description>
    <pubDate>Sun, 26 May 2019 19:44:29 GMT</pubDate>
    <dc:creator>agoldma</dc:creator>
    <dc:date>2019-05-26T19:44:29Z</dc:date>
    <item>
      <title>Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561625#M17201</link>
      <description>&lt;P&gt;I'm merging 2 datasets&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;count = 6,437,567 variables = &amp;nbsp;22&amp;nbsp;&amp;nbsp; file size = 1.2 GB&amp;nbsp;&amp;nbsp; page size =&amp;nbsp; 65,536&amp;nbsp;&amp;nbsp; # pages = 18,566&lt;/LI&gt;
&lt;LI&gt;count = 2,276,587 variables = 917&amp;nbsp; file size = 2.4 GB&amp;nbsp;&amp;nbsp; page size = 131,072&amp;nbsp; # pages = 19,726&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The resulting dataset has:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;count = 6,437,567 &amp;nbsp;&amp;nbsp;variables = 924&amp;nbsp; file size = &lt;STRONG&gt;21.8 GB&lt;/STRONG&gt;&amp;nbsp; page size = 131,072&amp;nbsp; # pages = 170,379&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I didn't expect to get 22 GB when merging 1GB with 2GB&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This is a straight merge without any calculations (dropping/merging a few fields)&lt;/P&gt;
&lt;P&gt;I get the same results with a data step merge as with SQL (left join)&lt;/P&gt;
&lt;P&gt;None of the character fields are excessively long (the longest is 40 and only a few are that long)&lt;/P&gt;
&lt;P&gt;All 3 datasets use CHAR compression;&amp;nbsp; no indexes yet&lt;/P&gt;
&lt;P&gt;Dataset option REUSE = YES reduced the output from 23 GB to 22 GB&lt;/P&gt;
&lt;P&gt;I'm using SAS-EG 9.0401M5 in Linux&lt;/P&gt;
&lt;P&gt;Does it really make sense for the output dataset to be 22 GB ?&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 17:48:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561625#M17201</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-26T17:48:05Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561628#M17202</link>
      <description>&lt;P&gt;Can you post your code please?&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 19:24:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561628#M17202</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2019-05-26T19:24:17Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561630#M17203</link>
      <description>&lt;P&gt;This is the data step version:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data project.ds3 (reuse=YES) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Compression is done automatically based on the setup of my SAS-EG&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 19:44:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561630#M17203</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-26T19:44:29Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561632#M17204</link>
      <description>&lt;P&gt;Are you sure that the data set is compressed? What happens if you add the Compress=Char Option like this?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data project.ds3 (reuse=YES compress=char) ;

   merge project.ds1 (in = inX )
         project.ds2  
         ;
   by field1
      field2
      ;
   if inX ;   * optional ;
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Sun, 26 May 2019 20:27:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561632#M17204</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2019-05-26T20:27:57Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561633#M17205</link>
      <description>&lt;P&gt;Which compression rate is reported in the log?&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 21:17:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561633#M17205</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2019-05-26T21:17:09Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561634#M17206</link>
      <description>&lt;P&gt;I know that the dataset ds3 is compressed because PROC CONTENTS says so and the log says that compression reduced the size by 54%&lt;/P&gt;
&lt;P&gt;I still tried adding COMPRESS=CHAR (at the top) but it didn't change anything&lt;/P&gt;
&lt;P&gt;I tried a few other things:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Data step copy -- produces the same size&lt;/LI&gt;
&lt;LI&gt;PROC COPY -- produces the same size&lt;/LI&gt;
&lt;LI&gt;PROC CPORT -- produces a smaller but still large transfer file of 14 GB&lt;BR /&gt;&amp;nbsp;... PROC CIMPORT&amp;nbsp; -- produces the same size (22 GB)&lt;/LI&gt;
&lt;LI&gt;COMPRESS=BINARY -- reduced size to 10 GB (sill very large, coming from 1 + 2 GB)&lt;/LI&gt;
&lt;LI&gt;PROC EXPORT to CSV = 9 GB (record length between 1045 and 19602)&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Maybe 10 GB is the right answer (just hard to believe that 1 + 2 = 10)&lt;/P&gt;
&lt;P&gt;Maybe this is why we made relational databases 50 years ago (and I should keep these 2 datasets separate)&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 21:56:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561634#M17206</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-26T21:56:40Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561638#M17207</link>
      <description>&lt;P&gt;10 sounds right.&amp;nbsp; You are adding over 900 variables to your 6 million observations.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2.4*3 = 7.2 + 2.1 = 9.3&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 22:12:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561638#M17207</guid>
      <dc:creator>Tom</dc:creator>
      <dc:date>2019-05-26T22:12:19Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561640#M17208</link>
      <description>&lt;P&gt;When I imported that CSV file, I got 2 GB -- like I expected ... and it had bufsize=64k with CHAR compression&lt;/P&gt;
&lt;P&gt;I tried making a dataset with bufsize=64k and BINARY compression... but it turned out to be 10GB&lt;/P&gt;
&lt;P&gt;&amp;nbsp;...even though BINARY compression reduced space by 79%&lt;/P&gt;
&lt;P&gt;That CSV file must be missing something... or what it's missing is useless&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 22:21:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561640#M17208</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-26T22:21:34Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561642#M17209</link>
      <description>&lt;P&gt;For each dataset we can roughly calculate the average amount of memory (bytes) occupied by each value:&lt;/P&gt;
&lt;P&gt;(page size * # pages)/(# obs. * # variables). The results are about 8.6 for ds1, 1.2 for ds2 and 3.8 for ds3.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The fact that the average value in ds2 (probably including numeric values of 8 bytes and character values of varying lengths, possibly up to 40) occupies only 1.2 bytes suggests that the average observation of ds2 is highly compressed. It is quite possible, though, that the compression rate varies considerably between records:&amp;nbsp;Observations with many missing character values will have much higher compression rates than observations with long non-missing values.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Observations of the latter type (in ds2) might have field1-field2 combinations whose BY groups in ds1 consist of several observations so that multiple copies of those long values are created. Thus, the distribution of compression rates can easily shift towards lower rates. (After all, it's plausible that you tend to match more non-missing values than missing values from ds2 to ds1.) But this means an increased average memory consumption in ds3 per value coming from ds2 (i.e. &amp;gt;1.2 bytes). The much larger space requirements per value from ds1 (8.6 bytes) increase the average in ds3 further -- why not to 3.8 bytes?&lt;/P&gt;</description>
      <pubDate>Sun, 26 May 2019 22:37:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561642#M17209</guid>
      <dc:creator>FreelanceReinh</dc:creator>
      <dc:date>2019-05-26T22:37:39Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561646#M17210</link>
      <description>You have variables overwriting each other as well. That doesn't concern you?&lt;BR /&gt;&lt;BR /&gt;917+22 - 2 (linkage) = 937 but you only have 924.</description>
      <pubDate>Mon, 27 May 2019 00:17:18 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561646#M17210</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2019-05-27T00:17:18Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561649#M17211</link>
      <description>&lt;P&gt;Hi Reeza,&lt;/P&gt;
&lt;P&gt;Thank you for looking at this&lt;/P&gt;
&lt;P&gt;I have a few variables overwriting eachother, but I'm not concerned about it because they have the same or nearly the same values.&lt;/P&gt;
&lt;P&gt;When I imported the CSV file, I got 2 GB like I wanted... but... most of those 900 variables changed from numeric to character, so instead of 8 bytes they were using 1 byte... which explains the reduction from 10 GB to 2 GB.&lt;/P&gt;
&lt;P&gt;I guess Tom and FreelanceReinhard are right -- that I shouldn't expect the same compression in the resulting dataset that I had in the source datasets.&lt;/P&gt;</description>
      <pubDate>Mon, 27 May 2019 00:45:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561649#M17211</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-27T00:45:44Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561667#M17212</link>
      <description>Did you import your data with PROC IMPORT? If it’s 1, I suspect you have a lot of missing, or truncated values.</description>
      <pubDate>Mon, 27 May 2019 04:19:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561667#M17212</guid>
      <dc:creator>Reeza</dc:creator>
      <dc:date>2019-05-27T04:19:54Z</dc:date>
    </item>
    <item>
      <title>Re: Unexpected Size of Dataset</title>
      <link>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561744#M17216</link>
      <description>&lt;P&gt;Yes, the import was bad or at least very different&lt;/P&gt;
&lt;P&gt;One big difference was that most of the 900 variables changed from numeric to character of length 1&lt;/P&gt;
&lt;P&gt;Length 1 is not sufficient for most of those variables&lt;/P&gt;
&lt;P&gt;They have many missing values, so PROC IMPORT probably didn't work right&lt;/P&gt;
&lt;P&gt;Expecting 2GB for the merged dataset is probably not realistic&lt;/P&gt;</description>
      <pubDate>Mon, 27 May 2019 13:06:22 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Data-Management/Unexpected-Size-of-Dataset/m-p/561744#M17216</guid>
      <dc:creator>agoldma</dc:creator>
      <dc:date>2019-05-27T13:06:22Z</dc:date>
    </item>
  </channel>
</rss>

