<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: spilt large dataset by group in Statistical Procedures</title>
    <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840791#M41668</link>
    <description>&lt;P&gt;&lt;EM&gt;"Planning to spilt into several dataset by every 100,000 of cus_number".&amp;nbsp;&lt;/EM&gt;I don't understand this. Can you clarify?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So, in your posted data, do you simply want to split the data by&amp;nbsp;&lt;SPAN&gt;Cus_Number? Or do you want the first 100.000 encountered&amp;nbsp;Cus_Numbers to be in one data set, the next&amp;nbsp;100.000 encountered&amp;nbsp;Cus_Numbers in another and so on?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 26 Oct 2022 10:30:35 GMT</pubDate>
    <dc:creator>PeterClemmensen</dc:creator>
    <dc:date>2022-10-26T10:30:35Z</dc:date>
    <item>
      <title>spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840788#M41667</link>
      <description>Hi. I have a large dataset that requires to spilt by cus_number. Planning to spilt into several dataset by every 100,000 of cus_number. there will be same cus_number in the dataset as the dataset contains transaction data too.&lt;BR /&gt;&lt;BR /&gt;any idea how i can spilt the data based on the above scenario?&lt;BR /&gt;&lt;BR /&gt;example:&lt;BR /&gt;Cus_Number Trnx&lt;BR /&gt;1234 Trnx 1 (consider as 1st cus_number)&lt;BR /&gt;1234 Trnx 2 (consider as 1st cus_number)&lt;BR /&gt;2345 Trnx 1 (consider as 2nd cus_number)&lt;BR /&gt;2345 Trnx 2 (consider as 2nd cus_number)&lt;BR /&gt;2345 Trnx 3 (consider as 2nd cus_number)&lt;BR /&gt;3456 Trnx 1 (consider as 3rd cus_number)&lt;BR /&gt;&lt;BR /&gt;Many thanks for your help&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 26 Oct 2022 10:17:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840788#M41667</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-26T10:17:34Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840791#M41668</link>
      <description>&lt;P&gt;&lt;EM&gt;"Planning to spilt into several dataset by every 100,000 of cus_number".&amp;nbsp;&lt;/EM&gt;I don't understand this. Can you clarify?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;So, in your posted data, do you simply want to split the data by&amp;nbsp;&lt;SPAN&gt;Cus_Number? Or do you want the first 100.000 encountered&amp;nbsp;Cus_Numbers to be in one data set, the next&amp;nbsp;100.000 encountered&amp;nbsp;Cus_Numbers in another and so on?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Oct 2022 10:30:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840791#M41668</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2022-10-26T10:30:35Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840808#M41669</link>
      <description>First 100,000 encountered cus_number to be in dataset 1, the subsequent 100,000 encountered cus_number in dataset 2 and so on.</description>
      <pubDate>Wed, 26 Oct 2022 11:14:57 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840808#M41669</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-26T11:14:57Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840810#M41670</link>
      <description>&lt;P&gt;Ok. Is the data sorted like in the posted sample data?&lt;/P&gt;</description>
      <pubDate>Wed, 26 Oct 2022 11:19:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840810#M41670</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2022-10-26T11:19:34Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840811#M41671</link>
      <description>yes. sorted by cus_number</description>
      <pubDate>Wed, 26 Oct 2022 11:21:55 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840811#M41671</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-26T11:21:55Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840825#M41672</link>
      <description>&lt;P&gt;Ok. See if you can use this as a template. I just made up some data to resemble your problem.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The input data set has 55 unique&amp;nbsp;&lt;SPAN&gt;cus_numbers and here I split by the first 10 distinct values in want_1, the next 10 distinct&amp;nbsp;values in want_2 and so on.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;SPAN&gt;Should be reasonably fast in your case as well.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
   call streaminit(123);
   do cus_number = 1 to 55;
      do _N_ = 1 to rand('integer', 1, 3);
         v1 = _N_ * 2;
         v2 = _N_ * 3;
         output;
      end;
   end;
run;

data _null_;

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have(obs = 0)', multidata : 'Y', ordered : 'Y');
      h.definekey('cus_number');
      h.definedata(all : 'Y');
      h.definedone();
   end;

   set have end = z;
   by cus_number;

   if last.cus_number then c + 1;

   h.add();

   if c = 10 | z then do;
      n + 1;

      h.output(dataset : cats('want_', n));
      h.clear();

      c = 0;
   end;

run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;U&gt;&lt;STRONG&gt;Result (Want_1):&lt;/STRONG&gt;&lt;/U&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;cus_number v1 v2
1          2  3
1          4  6
2          2  3
3          2  3
4          2  3
4          4  6
5          2  3
6          2  3
6          4  6
7          2  3
7          4  6
8          2  3
9          2  3
10         2  3&lt;/PRE&gt;</description>
      <pubDate>Wed, 26 Oct 2022 11:40:36 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840825#M41672</guid>
      <dc:creator>PeterClemmensen</dc:creator>
      <dc:date>2022-10-26T11:40:36Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840830#M41673</link>
      <description>&lt;P&gt;What is the benefit of doing such a split?&lt;/P&gt;</description>
      <pubDate>Wed, 26 Oct 2022 11:51:04 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840830#M41673</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2022-10-26T11:51:04Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840841#M41674</link>
      <description>&lt;P&gt;If you don't care about running time, could try this simple way.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;
data _null_;
 set temp;
 call execute(catt('data Cus_',Cus_Number,';set have;if Cus_Number=',Cus_Number,';run;' ));
run;&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Wed, 26 Oct 2022 12:21:16 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/840841#M41674</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2022-10-26T12:21:16Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841134#M41698</link>
      <description>Sorry. I am still quite at the beginner stage on coding.&lt;BR /&gt;Can't really replicate into my current code. Would like to clarify:-&lt;BR /&gt;1. do cus_number = 1 to 55; &amp;lt; this is based on the sequence of cus_number (i.e. 1, 2, 3 and not 1234, 2345)?&lt;BR /&gt;2. do _N_ = 1 to rand('integer', 1, 3);&lt;BR /&gt;v1 = _N_ * 2; &amp;lt; what does v1 and v2 mean?&lt;BR /&gt;v2 = _N_ * 3;&lt;BR /&gt;3. I actually have these variables in the dataset to split by customer name. List of variable: cus_number, name, ID_Number, Account_Number, Remitter_Name.&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Thu, 27 Oct 2022 13:36:35 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841134#M41698</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-27T13:36:35Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841135#M41699</link>
      <description>Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.</description>
      <pubDate>Thu, 27 Oct 2022 13:37:58 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841135#M41699</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-27T13:37:58Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841137#M41700</link>
      <description>&lt;P&gt;So I asked "what is the benefit"? And your reply was:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/391547"&gt;@abx&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;I guess I could read between the lines and make some guesses, but I'd rather have you tell me directly ... how does splitting this data set up help?&lt;/P&gt;</description>
      <pubDate>Thu, 27 Oct 2022 13:43:15 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841137#M41700</guid>
      <dc:creator>PaigeMiller</dc:creator>
      <dc:date>2022-10-27T13:43:15Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841139#M41701</link>
      <description>Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.</description>
      <pubDate>Thu, 27 Oct 2022 13:49:34 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841139#M41701</guid>
      <dc:creator>abx</dc:creator>
      <dc:date>2022-10-27T13:49:34Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841142#M41702</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;HR /&gt;&lt;a href="https://communities.sas.com/t5/user/viewprofilepage/user-id/391547"&gt;@abx&lt;/a&gt;&amp;nbsp;wrote:&lt;BR /&gt;Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.&lt;HR /&gt;&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Still have not answered why the split is so important to the process.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In a very large number of cases it is beneficial to add a variable that describes the "group" of interest and process using BY processing.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Oct 2022 14:13:27 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841142#M41702</guid>
      <dc:creator>ballardw</dc:creator>
      <dc:date>2022-10-27T14:13:27Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841293#M41711</link>
      <description>&lt;P&gt;If it was big table,&lt;/P&gt;
&lt;P&gt;you could write ONE data step for all.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;

filename x temp;
data _null_;
file x;
put 'data ';
 do until(last1);
  set temp end=last1;
  put 'Cus_'  Cus_Number;
 end;
put ';set have; select (Cus_Number);';
 do until(last2);
  set temp end=last2;
  put 'when(' Cus_Number ') output Cus_' Cus_Number ';';
 end;
put 'otherwise;end;run;';
stop;
run;

%include x/source;
&lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Fri, 28 Oct 2022 11:53:29 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841293#M41711</guid>
      <dc:creator>Ksharp</dc:creator>
      <dc:date>2022-10-28T11:53:29Z</dc:date>
    </item>
    <item>
      <title>Re: spilt large dataset by group</title>
      <link>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841309#M41713</link>
      <description>&lt;P&gt;If I understand correctly, I would agree that you should just produce one data set.&amp;nbsp; Add a variable that identifies which batch the current customer belongs to (1 for the first 100,000 customers, 2 for the next 100,000 customers, etc.).&amp;nbsp; That's actually fairly easy to do:&lt;/P&gt;
&lt;PRE&gt;&lt;CODE class=" language-sas"&gt;data want;
   set have;
   by cus_number;
   retain group 1 customer_count;
   if first.cus_number then customer_count + 1;
   output;
   if last.cus_number and customer_count = 100000 then do;
      customer_count = 0;
      group + 1;
   end;
run;&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;(Maybe not that easy ... it took me 3 tries to simplify the logic down to this final state.)&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 28 Oct 2022 14:44:01 GMT</pubDate>
      <guid>https://communities.sas.com/t5/Statistical-Procedures/spilt-large-dataset-by-group/m-p/841309#M41713</guid>
      <dc:creator>Astounding</dc:creator>
      <dc:date>2022-10-28T14:44:01Z</dc:date>
    </item>
  </channel>
</rss>

