<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Size of long vs wide dataset in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675369#M203475</link>
    <description>&lt;P&gt;It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.&lt;/P&gt;</description>
    <pubDate>Sat, 08 Aug 2020 06:45:27 GMT</pubDate>
    <dc:creator>Kurt_Bremser</dc:creator>
    <dc:date>2020-08-08T06:45:27Z</dc:date>
    <item>
      <title>Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675358#M203465</link>
      <description>&lt;P&gt;I have a very large dataset (70 gb) that has 6 columns and 200 million rows. I tested a subset of this data with about 3 mil rows. this takes about 1gb. I transposed this long dataset to a wide dataset that has about 5000 rows and 2000 columns (these columns are dates, except 5 first columns are other vars). This reduces the size significantly to 65mb.&lt;/P&gt;
&lt;P&gt;So my questions are: Why did the size change so much when I transpose from long to wide? how does SAS store data? what memory/size does it assign to row vs column? Are there tricks when dealing with such large datasets?&lt;/P&gt;
&lt;P&gt;Thanks&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 05:45:19 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675358#M203465</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-08-08T05:45:19Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675359#M203466</link>
      <description>&lt;P&gt;Which other variables (aside from the dates and the "by" variables) does your long dataset contain?&lt;/P&gt;
&lt;P&gt;Is your original dataset uncompressed, while your transposed dataset is compressed?&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:13:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675359#M203466</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-08-08T06:13:45Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675361#M203467</link>
      <description>&lt;P&gt;Please see attached photo for a sample of my LONG dataset.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="somebody_0-1596867415027.png" style="width: 400px;"&gt;&lt;img src="https://communities.sas.com/t5/image/serverpage/image-id/48070i3334E759656F26A1/image-size/medium?v=v2&amp;amp;px=400" role="button" title="somebody_0-1596867415027.png" alt="somebody_0-1596867415027.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;I use the following code to transpose the data.:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc transpose data=test out=test2;
	by fund_ticker secid name constituent_ticker detail_holding_type;
	id date; var holding;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;To my knowledge, they are not compressed. How do I compress a SAS dataset? and does compress means making like a zip or rar file?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:19:06 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675361#M203467</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-08-08T06:19:06Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675363#M203469</link>
      <description>&lt;P&gt;You compare two datasets:&lt;/P&gt;
&lt;P&gt;1) LONG = 3*10^6 rows *&amp;nbsp; (5 columns + 1 date);&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; each date holds 6 to 8 bytes depending on compression.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;So the date part holds between 6*3*10^6 to 8*3*10^6 bytes&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;i.e. 18 to 24 MB only out of 1 GB. &lt;STRONG&gt;Most of size holds the first 5 variables.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;2) WIDE = 5000 rows * (5 columns + 1995 dates);&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;The dates hold probably (compressed)&amp;nbsp; 5000*1995*6 bytes =~ 57 MB only.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;Thus the saved size is that part of 5 columns * (3*10^6 rows - 5000 rows).&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Check the length of the 5 first columns. I suppose that some those 5 can be codded to a number and use format to display the full value. You may save a lot of space.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:31:39 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675363#M203469</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-08-08T06:31:39Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675365#M203471</link>
      <description>for compression check proc option option=compress.&lt;BR /&gt;If need add option compress=yes.</description>
      <pubDate>Sat, 08 Aug 2020 06:29:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675365#M203471</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-08-08T06:29:57Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675366#M203472</link>
      <description>&lt;P&gt;The simple COMPRESS=YES dataset or system option uses a run-length compression that reduces sequences of identical characters/strings.&lt;/P&gt;
&lt;P&gt;You can see the state of a dataset in the output of PROC CONTENTS.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:30:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675366#M203472</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-08-08T06:30:58Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675367#M203473</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/98381"&gt;@somebody&lt;/a&gt;&amp;nbsp; - If you want to work through a dataset size calculation yourself:&amp;nbsp; &lt;A href="https://v8doc.sas.com/sashtml/cms/zas-size.htm" target="_blank"&gt;https://v8doc.sas.com/sashtml/cms/zas-size.htm&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Current SAS datasets use the V8 calculation.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:33:05 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675367#M203473</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2020-08-08T06:33:05Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675369#M203475</link>
      <description>&lt;P&gt;It sems that secid, name and probably constituent_ticker and detail_holding_type are redundant to fund_ticker. If that is the case, you can move them out to a lookup table and/or create fomats for them.&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 06:45:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675369#M203475</guid>
      <dc:creator>Kurt_Bremser</dc:creator>
      <dc:date>2020-08-08T06:45:27Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675371#M203477</link>
      <description>&lt;P&gt;I see there are many repetitives of FUND_FICKER and of DETAIL_HOLDING_TYPE.&lt;/P&gt;
&lt;P&gt;At least those two can be replaced by codes, having next format&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;proc format lib=library;
  value  fticker
    1 = 'AADR'
   .... add existing/possible values ...
   ;
  value $iticker
   'ADDR' = '1'
   ...
   ;
  value htype
    1 = 'EQUITY'
    2 = 'CASH'
   .... add existing/possible values ...
   ; 
  value $itype
    'EQUITY' = '1'
    'CASH' = '2'
    ...
   ;	
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN style="background-color: #f5f5f5; font-family: Menlo, Monaco, Consolas, 'Courier New', monospace; font-size: 13px;"&gt;then replace those variables in your date, converting from char type to numeric:&lt;BR /&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data new;
 set have(rename = (fund_ticker=v1 Detail_Holding_type=v2));
     fund_ticker = input(put(v1,$iticker.),best3.);
	 Detail_Holding_type = input(put(v2,$itype.),best3.);
	 drop v1 v2;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&lt;SPAN style="background-color: #f5f5f5; font-family: Menlo, Monaco, Consolas, 'Courier New', monospace; font-size: 13px;"&gt;Beyond, if name is one to one as SecID you can drop NAME from your dataset&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="background-color: #f5f5f5; font-family: Menlo, Monaco, Consolas, 'Courier New', monospace; font-size: 13px;"&gt;and display it by an appropriate format.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 07:27:11 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675371#M203477</guid>
      <dc:creator>Shmuel</dc:creator>
      <dc:date>2020-08-08T07:27:11Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675386#M203488</link>
      <description>&lt;P&gt;A long (I find&amp;nbsp;&lt;EM&gt;tall&lt;/EM&gt; a better description) data set can be considered a large monolithic rectangle containing one or more categorical columns (&lt;EM&gt;identity'ish roles&lt;/EM&gt;) having repeated values.&amp;nbsp; A combination of columns having the same values in different rows form a group.&amp;nbsp; The categorical columns model either the concept of a 'row' identifier, or a hierarchical organization suitable for BY group processing.&amp;nbsp; The repeated values are what can make a tall table large.&amp;nbsp; Suppose you had a categorical variable that is char(40) and an average group size of 100.&amp;nbsp; That would be 4,000 characters per group.&amp;nbsp; Pivoting (or transposing) the tall group into a wide row with 100 variables (presuming a model of one wide variable per tall row) there will be only ONE instance of the categorical value, reducing the repetition and resulting in nominally a 100:1 reduction in storage space required for the groups identity variables.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Roughly, the larger the storage needed for group/hierarchy identity the larger the reduction in space will be.&amp;nbsp; However, switching data form from tall to wide may increase complexity and harden code in downstream process flows.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Time is money spent with no refunds.&amp;nbsp; Disk space is cheap, coder time and maintenance is not.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 08 Aug 2020 13:42:12 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675386#M203488</guid>
      <dc:creator>RichardDeVen</dc:creator>
      <dc:date>2020-08-08T13:42:12Z</dc:date>
    </item>
    <item>
      <title>Re: Size of long vs wide dataset</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675459#M203534</link>
      <description>&lt;P&gt;Thank you, i thought so too. but being a phd student, disk space is also not cheap &lt;span class="lia-unicode-emoji" title=":grinning_face_with_smiling_eyes:"&gt;😄&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 09 Aug 2020 03:48:45 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Size-of-long-vs-wide-dataset/m-p/675459#M203534</guid>
      <dc:creator>somebody</dc:creator>
      <dc:date>2020-08-09T03:48:45Z</dc:date>
    </item>
  </channel>
</rss>

