<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Hash component object experiments in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76126#M16435</link>
    <description>Yep. Beside that, the maximum hash buckets you get are about 2^16=65536 (hashexp=16). So 20000000 objects would produce a pretty poor performance. I would say depending on the row length your are trying to store, 2 to 3 Millions would be the maximum reasonable objects to load into the hash. Ideally, 1.5 Millions.&lt;BR /&gt;
&lt;BR /&gt;
Check the following exquisite paper of Paul Dorfman about Hashing:&lt;BR /&gt;
&lt;A href="http://support.sas.com/resources/papers/proceedings09/153-2009.pdf" target="_blank"&gt;http://support.sas.com/resources/papers/proceedings09/153-2009.pdf&lt;/A&gt;&lt;BR /&gt;
&lt;BR /&gt;
If you are looking for a performance sort, check the parallel features of the SAS engine, a well known solution is to break up your large dataset into smaller ones, sort them using parallel processing, then reunite the results with interleaving (data step).&lt;BR /&gt;
&lt;BR /&gt;
Cheers from Portugal.&lt;BR /&gt;
&lt;BR /&gt;
Daniel Santos @ &lt;A href="http://www.cgd.pt" target="_blank"&gt;www.cgd.pt&lt;/A&gt;</description>
    <pubDate>Mon, 19 Oct 2009 10:48:30 GMT</pubDate>
    <dc:creator>DanielSantos</dc:creator>
    <dc:date>2009-10-19T10:48:30Z</dc:date>
    <item>
      <title>Hash component object experiments</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76124#M16433</link>
      <description>Hello!&lt;BR /&gt;
&lt;BR /&gt;
I am trying to sort large dataset using hash but getting a fatal error:&lt;BR /&gt;
&lt;BR /&gt;
170   data dssxx; set dss0906 (obs=20000000); run;&lt;BR /&gt;
&lt;BR /&gt;
NOTE: There were 20000000 observations read from the data set WORK.DSS0906.&lt;BR /&gt;
NOTE: The data set WORK.DSSXX has 20000000 observations and 8 variables.&lt;BR /&gt;
NOTE: Compressing data set WORK.DSSXX decreased size by 27.93 percent.&lt;BR /&gt;
      Compressed is 228803 pages; un-compressed would require 317461 pages.&lt;BR /&gt;
NOTE: DATA statement used (Total process time):&lt;BR /&gt;
      real time           3:04.45&lt;BR /&gt;
      cpu time            1:19.01&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
171   data dss;&lt;BR /&gt;
172    if 0 then set dssxx;&lt;BR /&gt;
173   dcl hash hh (dataset: 'work.dssxx', hashexp: 0, ordered: 'd');&lt;BR /&gt;
174   dcl hiter hi ('hh');&lt;BR /&gt;
175   hh.definekey ('ss_kod', 'site', 'sbal_kod', 'data' );&lt;BR /&gt;
176   hh.definedata ('site', 'sbal_kod', 'ss_kod', 'data', 'ss_ostd', 'ss_ostc');&lt;BR /&gt;
177   hh.definedone();&lt;BR /&gt;
178   do rc=hi.first() by 0 while(rc=0);&lt;BR /&gt;
179   ost=ss_ostd-ss_ostc;&lt;BR /&gt;
180   output;&lt;BR /&gt;
181   rc=hi.next();&lt;BR /&gt;
182   end;&lt;BR /&gt;
183   drop rc ss_ostd ss_ostc ss_obd ss_obc; rename data=date;&lt;BR /&gt;
184   stop;&lt;BR /&gt;
185   run;&lt;BR /&gt;
&lt;BR /&gt;
FATAL: Insufficient memory to execute data step program. Aborted during the EXECUTION phase.&lt;BR /&gt;
NOTE: The SAS System stopped processing this step because of insufficient memory.&lt;BR /&gt;
WARNING: The data set WORK.DSS may be incomplete.  When this step was stopped there were 0 observations and 5 variables.&lt;BR /&gt;
WARNING: Data set WORK.DSS was not replaced because this step was stopped.&lt;BR /&gt;
NOTE: DATA statement used (Total process time):&lt;BR /&gt;
      real time           41.31 seconds&lt;BR /&gt;
      cpu time            40.59 seconds&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
The smaller subset sorts fine by the same code:&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
186   data dssxx; set dss0906 ; where site ne 'MSK'; run;&lt;BR /&gt;
&lt;BR /&gt;
NOTE: There were 4118570 observations read from the data set WORK.DSS0906.&lt;BR /&gt;
      WHERE site not = 'MSK';&lt;BR /&gt;
NOTE: The data set WORK.DSSXX has 4118570 observations and 8 variables.&lt;BR /&gt;
NOTE: Compressing data set WORK.DSSXX decreased size by 29.04 percent.&lt;BR /&gt;
      Compressed is 46390 pages; un-compressed would require 65375 pages.&lt;BR /&gt;
NOTE: DATA statement used (Total process time):&lt;BR /&gt;
      real time           6:48.00&lt;BR /&gt;
      cpu time            1:08.98&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
187   data dss;&lt;BR /&gt;
188    if 0 then set dssxx;&lt;BR /&gt;
189   dcl hash hh (dataset: 'work.dssxx', hashexp: 0, ordered: 'd');&lt;BR /&gt;
190   dcl hiter hi ('hh');&lt;BR /&gt;
191   hh.definekey ('ss_kod', 'site', 'sbal_kod', 'data' );&lt;BR /&gt;
192   hh.definedata ('site', 'sbal_kod', 'ss_kod', 'data', 'ss_ostd', 'ss_ostc');&lt;BR /&gt;
193   hh.definedone();&lt;BR /&gt;
194   do rc=hi.first() by 0 while(rc=0);&lt;BR /&gt;
195   ost=ss_ostd-ss_ostc;&lt;BR /&gt;
196   output;&lt;BR /&gt;
197   rc=hi.next();&lt;BR /&gt;
198   end;&lt;BR /&gt;
199   drop rc ss_ostd ss_ostc ss_obd ss_obc; rename data=date;&lt;BR /&gt;
200   stop;&lt;BR /&gt;
201   run;&lt;BR /&gt;
&lt;BR /&gt;
NOTE: There were 4118570 observations read from the data set WORK.DSSXX.&lt;BR /&gt;
NOTE: The data set WORK.DSS has 4118570 observations and 5 variables.&lt;BR /&gt;
NOTE: Compressing data set WORK.DSS increased size by 7.72 percent.&lt;BR /&gt;
      Compressed is 43927 pages; un-compressed would require 40779 pages.&lt;BR /&gt;
NOTE: DATA statement used (Total process time):&lt;BR /&gt;
      real time           27.81 seconds&lt;BR /&gt;
      cpu time            18.78 seconds&lt;BR /&gt;
&lt;BR /&gt;
&lt;BR /&gt;
Does it mean that hashing not good for large datasets? Or something wrong with my code?&lt;BR /&gt;
Machine is under Windows XP, 2.5 GB RAM, enough free disc space.&lt;BR /&gt;
&lt;BR /&gt;
Thanks for any thoughts.</description>
      <pubDate>Mon, 19 Oct 2009 05:17:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76124#M16433</guid>
      <dc:creator>Oleg_L</dc:creator>
      <dc:date>2009-10-19T05:17:29Z</dc:date>
    </item>
    <item>
      <title>Re: Hash component object experiments</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76125#M16434</link>
      <description>The hash object I believe is stored in RAM, so large disk does not help.&lt;BR /&gt;
We have the same experience as you, too large look-up tables will exhaust memory, so we have to use SQL-joins etc for the largest look-ups.&lt;BR /&gt;
&lt;BR /&gt;
See this note, there might be way to work-around if you like to stick with hashing:&lt;BR /&gt;
&lt;A href="http://support.sas.com/kb/16/920.html" target="_blank"&gt;http://support.sas.com/kb/16/920.html&lt;/A&gt;&lt;BR /&gt;
&lt;BR /&gt;
/Linus</description>
      <pubDate>Mon, 19 Oct 2009 10:04:53 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76125#M16434</guid>
      <dc:creator>LinusH</dc:creator>
      <dc:date>2009-10-19T10:04:53Z</dc:date>
    </item>
    <item>
      <title>Re: Hash component object experiments</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76126#M16435</link>
      <description>Yep. Beside that, the maximum hash buckets you get are about 2^16=65536 (hashexp=16). So 20000000 objects would produce a pretty poor performance. I would say depending on the row length your are trying to store, 2 to 3 Millions would be the maximum reasonable objects to load into the hash. Ideally, 1.5 Millions.&lt;BR /&gt;
&lt;BR /&gt;
Check the following exquisite paper of Paul Dorfman about Hashing:&lt;BR /&gt;
&lt;A href="http://support.sas.com/resources/papers/proceedings09/153-2009.pdf" target="_blank"&gt;http://support.sas.com/resources/papers/proceedings09/153-2009.pdf&lt;/A&gt;&lt;BR /&gt;
&lt;BR /&gt;
If you are looking for a performance sort, check the parallel features of the SAS engine, a well known solution is to break up your large dataset into smaller ones, sort them using parallel processing, then reunite the results with interleaving (data step).&lt;BR /&gt;
&lt;BR /&gt;
Cheers from Portugal.&lt;BR /&gt;
&lt;BR /&gt;
Daniel Santos @ &lt;A href="http://www.cgd.pt" target="_blank"&gt;www.cgd.pt&lt;/A&gt;</description>
      <pubDate>Mon, 19 Oct 2009 10:48:30 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76126#M16435</guid>
      <dc:creator>DanielSantos</dc:creator>
      <dc:date>2009-10-19T10:48:30Z</dc:date>
    </item>
    <item>
      <title>Re: Hash component object experiments</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76127#M16436</link>
      <description>Thanks a lot for information. It's clear now.</description>
      <pubDate>Mon, 19 Oct 2009 11:13:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Hash-component-object-experiments/m-p/76127#M16436</guid>
      <dc:creator>Oleg_L</dc:creator>
      <dc:date>2009-10-19T11:13:09Z</dc:date>
    </item>
  </channel>
</rss>

