<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic The divided file takes up too much space in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688119#M208986</link>
    <description>&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN title=""&gt;Hi guys&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN title=""&gt;I have a little problem.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="" title=""&gt;In the program I read a file about 50gb so to speed up the calculations I divided it into smaller ones according to one of the variables.&lt;/SPAN&gt; &lt;SPAN title=""&gt;This resulted in a significant shortening of the calculations.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;But another problem arose.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;The divided files take up much more than 100gb of disk space (many small files of 128kb each).&lt;/SPAN&gt; &lt;SPAN title=""&gt;I use sas 9.4.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;You can do something about it.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN class="" title=""&gt; &lt;SPAN title=""&gt;Thank you for your help&lt;/SPAN&gt;&lt;BR /&gt;Best &lt;SPAN class="alt-edited" title=""&gt;regards&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 01 Oct 2020 07:19:40 GMT</pubDate>
    <dc:creator>makset</dc:creator>
    <dc:date>2020-10-01T07:19:40Z</dc:date>
    <item>
      <title>The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688119#M208986</link>
      <description>&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN title=""&gt;Hi guys&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN title=""&gt;I have a little problem.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="" title=""&gt;In the program I read a file about 50gb so to speed up the calculations I divided it into smaller ones according to one of the variables.&lt;/SPAN&gt; &lt;SPAN title=""&gt;This resulted in a significant shortening of the calculations.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;But another problem arose.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;The divided files take up much more than 100gb of disk space (many small files of 128kb each).&lt;/SPAN&gt; &lt;SPAN title=""&gt;I use sas 9.4.&lt;/SPAN&gt; &lt;SPAN class="" title=""&gt;You can do something about it.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN class="" title=""&gt; &lt;SPAN title=""&gt;Thank you for your help&lt;/SPAN&gt;&lt;BR /&gt;Best &lt;SPAN class="alt-edited" title=""&gt;regards&lt;/SPAN&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 07:19:40 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688119#M208986</guid>
      <dc:creator>makset</dc:creator>
      <dc:date>2020-10-01T07:19:40Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688123#M208987</link>
      <description>&lt;P&gt;&lt;EM&gt;We&lt;/EM&gt; can't do anything about it, that's up to&amp;nbsp;&lt;EM&gt;you&lt;/EM&gt;. We do not have access to your SAS server &lt;span class="lia-unicode-emoji" title=":winking_face:"&gt;😉&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;But we can give you hints.&lt;/P&gt;
&lt;P&gt;My first suspicion is that your original dataset is compressed, and your subset datasets are not.&lt;/P&gt;
&lt;P&gt;Make sure to use the COMPRESS=YES dataset option when creating the subsets.&lt;/P&gt;
&lt;P&gt;Run a PROC CONTENTS on your original dataset to see if and how it is compressed.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 07:34:44 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688123#M208987</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-01T07:34:44Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688137#M208993</link>
      <description>&lt;P&gt;1. 50 GB to 128 kB seem like way too many small files. Can you make larger chunks? They will compress better. If you use a binary-compressed SPDE library, the compression will be much higher still. But not on such small files.&lt;/P&gt;
&lt;P&gt;2. Another way is to store the files in a compressed folder. Larger files are also better here.&lt;/P&gt;
&lt;P&gt;3. Another way is to process the original large file in chunks by using a BY statement, or by using successive where clauses.&lt;/P&gt;
&lt;P&gt;4. 50 GB to 128 kB&amp;nbsp; is about 400,000 files. Are you sure you want this?&lt;/P&gt;
&lt;P&gt;5. With 128 kB files, you waste a good part of the disk space, depending on the file system's cluster size.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In summary: the method you describe seems sub-optimal.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 09:01:54 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688137#M208993</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-01T09:01:54Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688139#M208995</link>
      <description>&lt;P&gt;I totally missed that:&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;many small files of 128kb each&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;That is WAY too small. That's basically a single SAS dataset page for each, so you create LOTs of overhead, no matter which options you use.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Try your calculation on a subset of about 5 GB in size (a tenth of the original dataset).&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;And also identify which part of your calculations takes up too much time when using the whole dataset. There may be more efficient methods that allow you to process the whole dataset at once.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 09:17:59 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688139#M208995</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-01T09:17:59Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688150#M209001</link>
      <description>&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN title=""&gt;I split the entire dataset by the values of the three variables.&lt;/SPAN&gt;&lt;BR /&gt;&lt;SPAN class="" title=""&gt;Small files are not my guess but the distribution of the variable.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="tlid-translation translation"&gt;&lt;SPAN class="" title=""&gt;This is not optimal in terms of disk space, but in terms of computing speed, yes&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 10:16:48 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688150#M209001</guid>
      <dc:creator>makset</dc:creator>
      <dc:date>2020-10-01T10:16:48Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688154#M209002</link>
      <description>&lt;P&gt;If you need to process &lt;EM&gt;all&lt;/EM&gt; data anyway, the overall time will increase by splitting. And some analysis will only be valid if run on all data at once.&lt;/P&gt;
&lt;P&gt;Splitting makes sense if only a subset is needed &lt;EM&gt;repeatedly&lt;/EM&gt;&amp;nbsp;(otherwise a WHERE condition in the first step will be sufficient), resulting in LESS disk space, not MORE as in your case, or if you just need an arbitrary subset for testing your code before running it on the whole dataset.&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 10:42:23 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688154#M209002</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-10-01T10:42:23Z</dc:date>
    </item>
    <item>
      <title>Re: The divided file takes up too much space</title>
      <link>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688368#M209094</link>
      <description>&lt;P&gt;&lt;EM&gt;&amp;gt;&amp;nbsp;This is not optimal in terms of disk space, but in terms of computing speed, yes&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;I doubt it, and as you can see I am not the only one.&lt;/P&gt;
&lt;P&gt;And we haven't mentioned the time needed to create and delete these files.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Anyway, it seems you have two solutions:&lt;/P&gt;
&lt;P&gt;- Implement our suggestions and have larger files. Maybe use 2 variables instead of 3?.Or even 1.&lt;/P&gt;
&lt;P&gt;- Keep your method, In this case, you create 200,00 files, process them, and then do the same for the other half.&lt;/P&gt;
&lt;P&gt;.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Oct 2020 21:05:24 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/The-divided-file-takes-up-too-much-space/m-p/688368#M209094</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2020-10-01T21:05:24Z</dc:date>
    </item>
  </channel>
</rss>

