<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Split a large dataset efficiently in SAS Programming</title>
    <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583559#M166126</link>
    <description>&lt;P&gt;Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 23 Aug 2019 17:30:17 GMT</pubDate>
    <dc:creator>prad001</dc:creator>
    <dc:date>2019-08-23T17:30:17Z</dc:date>
    <item>
      <title>Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582614#M165742</link>
      <description>&lt;P&gt;Hi All,&lt;/P&gt;&lt;P&gt;Please help me to split the dataset efficiently (preferably hash). Since the data set TEST has duplicates, getting error while splitting using hash.. All i wanted is not to split by record count but by NEW, which is not working with the below code.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;LIBNAME one 'H:\';&lt;BR /&gt;DATA TEST;&lt;BR /&gt;DO new = 1 TO 10000000; OUTPUT;&lt;BR /&gt;OUTPUT;END;&lt;BR /&gt;proc sort; by new;&lt;BR /&gt;RUN;&lt;/P&gt;&lt;P&gt;data _null_;&lt;BR /&gt;if 0 then&lt;BR /&gt;set test;&lt;BR /&gt;declare hash h_out();&lt;BR /&gt;h_out.definekey('new');&lt;BR /&gt;h_out.definedata('new');&lt;BR /&gt;h_out.definedone();&lt;/P&gt;&lt;P&gt;do filenum=1 by 100 until(eof);&lt;BR /&gt;do new=1 to 100 until(eof);&lt;BR /&gt;set test end=eof;&lt;BR /&gt;h_out.add();&lt;BR /&gt;by new;&lt;BR /&gt;end;&lt;/P&gt;&lt;P&gt;h_out.output(dataset:cats('one.out_',filenum));&lt;BR /&gt;h_out.clear();&lt;BR /&gt;end;&lt;/P&gt;&lt;P&gt;stop;&lt;BR /&gt;run;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;%macro srt;&lt;BR /&gt;%DO i=1 %to 10000000 %BY 100;&lt;BR /&gt;proc sort data=one.out_&amp;amp;i;&lt;BR /&gt;by new;&lt;BR /&gt;run;&lt;BR /&gt;%end;&lt;BR /&gt;%MEND;&lt;BR /&gt;%srt;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 21:59:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582614#M165742</guid>
      <dc:creator>prad001</dc:creator>
      <dc:date>2019-08-20T21:59:32Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582624#M165748</link>
      <description>&lt;P&gt;Have you tried using the option multidata to allow for duplicate keys?&lt;/P&gt;
&lt;P&gt;See &lt;A href="https://support.sas.com/resources/papers/proceedings16/10200-2016.pdf" target="_self"&gt;here&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 22:54:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582624#M165748</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-08-20T22:54:16Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582629#M165753</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/237405"&gt;@prad001&lt;/a&gt;&amp;nbsp; Are you asking how to split 10 million&amp;nbsp; records by NEW which is sets of 2. So 10million/2 =5million datasets your requirement?&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:12:10 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582629#M165753</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-08-20T23:12:10Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582630#M165754</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/138205"&gt;@novinosrin&lt;/a&gt;&amp;nbsp;+1&lt;/P&gt;
&lt;P&gt;I didn't look at the code!&amp;nbsp;&lt;/P&gt;
&lt;P&gt;And how many times is proc sort called?&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:23:14 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582630#M165754</guid>
      <dc:creator>ChrisNZ</dc:creator>
      <dc:date>2019-08-20T23:23:14Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582631#M165755</link>
      <description>&lt;P&gt;Sir, I didn't particularly look into the code beyond&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;All i wanted is not to split by record count but by NEW, which is not working with the below code. 

LIBNAME one 'H:\';
DATA TEST;
DO new = 1 TO 10000000; OUTPUT;
OUTPUT;END;
proc sort; by new;
RUN;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;So for every iteration , two output statements making it rather 20 million records. 11,22,33 up to 20e6 records. And OP wants to split each by group into a dataset i.e 10 million datasets.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Of course for you and me et al regulars let alone splitting logic code is all over the internet , it's not the coding part that is concerning but the objective doesn't make sense at all. Well perhaps learning practice? Even so why anybody would practice with such large splits if i understand correctly&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:30:51 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582631#M165755</guid>
      <dc:creator>novinosrin</dc:creator>
      <dc:date>2019-08-20T23:30:51Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582632#M165756</link>
      <description>&lt;P&gt;Why do you think you need to use hash processing? There are lots of easier ways to split data.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;It would help if you could explain your requirements in words rather than us trying to confirm what you are trying to do from your code.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:31:50 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582632#M165756</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2019-08-20T23:31:50Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582633#M165757</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/237405"&gt;@prad001&lt;/a&gt;&amp;nbsp;:&lt;/P&gt;
&lt;P&gt;Try this, for example:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have ;                                          
  do new = 1 to 1234 ;                               
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;              
    output ;                                         
  end ;                                              
run ;                                                
                                                     
%let new_incr = 100 ;                                
                                                     
data _null_ ;                                        
  if _n_ = 1 then do ;                               
    dcl hash h (dataset:"have(obs=0)", ordered:"a") ;
    h.definekey ("new") ;                            
    h.definedata (all:"y") ;                         
    h.definedone () ;                                
  end ;                                              
  do until (mod (new, &amp;amp;new_incr) = 0 or lr) ;        
    set have end = lr ;                              
    h.add() ;                                        
  end ;                                              
  h.output (dataset: catx ("_", "work.out", _n_)) ;  
  h.clear () ;                                       
run ;                                                
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:33:32 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582633#M165757</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-08-20T23:33:32Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582637#M165761</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13976"&gt;@SASKiwi&lt;/a&gt;;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I'd rather question the very idea of the need to split a data set instead of creating a BY variable (for example, using MOD).&lt;/P&gt;
&lt;P&gt;But as long as the need is justified, the hash approach looks fairly easy to me ... as long as the largest split group fits in memory comfortably.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:43:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582637#M165761</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-08-20T23:43:36Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582640#M165764</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/21262"&gt;@hashman&lt;/a&gt;&amp;nbsp; - agreed, why do you need to split your data in the first place? Yes, hash techniques are really useful, but I'd argue there are more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers.&lt;/P&gt;</description>
      <pubDate>Tue, 20 Aug 2019 23:58:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582640#M165764</guid>
      <dc:creator>SASKiwi</dc:creator>
      <dc:date>2019-08-20T23:58:29Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582657#M165771</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/13976"&gt;@SASKiwi&lt;/a&gt;&amp;nbsp;:&lt;/P&gt;
&lt;P&gt;All right, I'm game!&amp;nbsp;You've seen my hash approach for the OP's task. To simplify, HAVE has a variable NEW ranging in order from 1 to some integer N:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;%let N = 10011 ;    
                    
data have ;         
  do new = 1 to &amp;amp;n ;
    output ;        
  end ;             
run ;               
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;HAVE needs to be split into an &lt;EM&gt;a priori unknown&lt;/EM&gt; number of data sets with sequentially numbered names, N_INCR records each. If N isn't divisible by N_INCR, the last split file will contain fewer than N_INCR output records. Here's the hash approach:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;%let N_incr = 100 ;                                          
                                                             
data _null_ ;                                                
  if _n_ = 1 then do ;                                       
    dcl hash h (ordered:"A") ;                               
    h.definekey ("new") ;                                    
    h.definedone () ;                                        
  end ;                                                      
  do until (mod (new, &amp;amp;n_incr) = 0 or lr) ;                  
    set have end = lr ;                                      
    h.add() ;                                                
  end ;                                                      
  h.output (dataset: catx ("_", "work.out", put (_n_,z3.))) ;
  h.clear() ;                                                
run ;                                                        
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Could you please show one of&amp;nbsp; "&lt;SPAN&gt;more common techniques that would work equally well here and be easier for others to support when you are in a team of SAS developers"? Am asking since though I do know a few, none can touch the simplicity of the code above in terms of logic, ease of coding, self-automation, and - as a corollary - ease of support.&amp;nbsp;&lt;/SPAN&gt;&amp;nbsp;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;On a different note, the hash object has been around for 16 years. The part of its overall functionality used in the DATA step above is marginal at best. The rest relies on two statements (DATA and SET), one function (MOD), one DO loop, and yes, pretty firm understanding of what _N_ really is and how the DATA step really works. Being a "SAS developer" should more than cover this territory, shouldn't it?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 21 Aug 2019 03:55:43 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/582657#M165771</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-08-21T03:55:43Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583342#M166058</link>
      <description>&lt;P&gt;Thank you for helping Paul (Hashman)..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Your code works fine when the "NEW" variable is unique., what if it has duplicates ??&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;What I am trying is..&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;data have ;&lt;BR /&gt;do new = 1 to &amp;amp;n ;&lt;BR /&gt;output ; Output; ************************************************* 2 OUTPUTS here to create duplicates;&lt;BR /&gt;end ;&lt;BR /&gt;run ;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;All I want is split the dataset into 100 by NEW. If the first 100 contains duplicates then it will have more than 100.&lt;/P&gt;&lt;P&gt;In the above example...&lt;/P&gt;&lt;P&gt;1st dataset will contain..&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;NEW&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;1&lt;/P&gt;&lt;P&gt;1&lt;/P&gt;&lt;P&gt;2&lt;/P&gt;&lt;P&gt;2&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;100&lt;/P&gt;&lt;P&gt;100&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;And the second dataset should contain...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;NEW&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;101&lt;/P&gt;&lt;P&gt;101&lt;/P&gt;&lt;P&gt;102&lt;/P&gt;&lt;P&gt;102&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;.&lt;/P&gt;&lt;P&gt;200&lt;/P&gt;&lt;P&gt;200&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks alot in advance..&lt;/P&gt;&lt;P&gt;Pradeep&lt;/P&gt;</description>
      <pubDate>Thu, 22 Aug 2019 21:59:09 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583342#M166058</guid>
      <dc:creator>prad001</dc:creator>
      <dc:date>2019-08-22T21:59:09Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583371#M166064</link>
      <description>&lt;P&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/237405"&gt;@prad001&lt;/a&gt;&amp;nbsp;:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Just a few subtle alterations:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Create a view into HAVE with a dummy key unique for each range. Thus, below, _KNEW=1 for&amp;nbsp; 1-100, 2 for 101-200, and so on. This simplifies the splitting task, as it reduces it to simple BY processing.&lt;/LI&gt;
&lt;LI&gt;Code the argument tag MULTIDATA:"Y" to allow hash items with identical key-values.&lt;/LI&gt;
&lt;LI&gt;Code for dropping _KNEW from the output data sets.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;In sum:&lt;/P&gt;
&lt;PRE&gt;data have ;                                                         
  do new = 1 to 1234 ;                                              
    retain N1 1 C1 "C1" N2 2 C2 "C22" ;                             
    &lt;FONT color="#000080"&gt;&lt;STRONG&gt;do _n_ = 1 to ceil (ranuni(1) * 4) ; /*create dupes*/&lt;/STRONG&gt;&lt;/FONT&gt;                            
      output ;                                                      
    &lt;FONT color="#000080"&gt;&lt;STRONG&gt;end ;&lt;/STRONG&gt; &lt;/FONT&gt;                                                          
  end ;                                                             
run ;                                                               
                                                                    
%let new_incr = 100 ;                                               
                                                                    
&lt;FONT color="#000080"&gt;&lt;STRONG&gt;data knew / view = knew ;                                           
  set have ;                                                        
  _knew = ceil (new / &amp;amp;new_incr) ;                                  
run ;&lt;/STRONG&gt; &lt;/FONT&gt;                                                              
                                                                    
data _null_ ;                                                       
  if _n_ = 1 then do ;                                              
    dcl hash h (dataset:&lt;FONT color="#000080"&gt;&lt;STRONG&gt;"knew(obs=0)", multidata:"Y"&lt;/STRONG&gt;&lt;/FONT&gt;, ordered:"A") ;
    &lt;FONT color="#000080"&gt;&lt;STRONG&gt;h.definekey ("_knew") ;&lt;/STRONG&gt;&lt;/FONT&gt;                                         
    h.definedata (all:"Y") ;                                        
    h.definedone () ;                                               
  end ;                                                             
  do until &lt;FONT color="#000080"&gt;&lt;STRONG&gt;(last._knew)&lt;/STRONG&gt;&lt;/FONT&gt; ;                                           
    set &lt;FONT color="#000080"&gt;&lt;STRONG&gt;knew&lt;/STRONG&gt; &lt;/FONT&gt;;                                                      
    &lt;FONT color="#000080"&gt;&lt;STRONG&gt;by _knew ;&lt;/STRONG&gt; &lt;/FONT&gt;                                                     
    h.add() ;                                                       
  end ;                                                             
  h.output (dataset: cats ("work.out_", _n_, &lt;FONT color="#000080"&gt;&lt;STRONG&gt;"(drop=_knew)"&lt;/STRONG&gt;&lt;/FONT&gt;)) ;     
  h.clear () ;                                                      
run ;                                                               
&lt;/PRE&gt;
&lt;P&gt;Kind regards&lt;/P&gt;
&lt;P&gt;Paul D.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2019 00:38:02 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583371#M166064</guid>
      <dc:creator>hashman</dc:creator>
      <dc:date>2019-08-23T00:38:02Z</dc:date>
    </item>
    <item>
      <title>Re: Split a large dataset efficiently</title>
      <link>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583559#M166126</link>
      <description>&lt;P&gt;Thank you Paul (Hashman). Thanks alot for a spontaneous and perfect reply.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 23 Aug 2019 17:30:17 GMT</pubDate>
      <guid>https://communities.sas.com/t5/SAS-Programming/Split-a-large-dataset-efficiently/m-p/583559#M166126</guid>
      <dc:creator>prad001</dc:creator>
      <dc:date>2019-08-23T17:30:17Z</dc:date>
    </item>
  </channel>
</rss>

