topic Re: spilt large dataset by group in Statistical Procedures

spilt large dataset by group

abx — Wed, 26 Oct 2022 10:17:34 GMT

Hi. I have a large dataset that requires to spilt by cus_number. Planning to spilt into several dataset by every 100,000 of cus_number. there will be same cus_number in the dataset as the dataset contains transaction data too.

any idea how i can spilt the data based on the above scenario?

example:
Cus_Number Trnx
1234 Trnx 1 (consider as 1st cus_number)
1234 Trnx 2 (consider as 1st cus_number)
2345 Trnx 1 (consider as 2nd cus_number)
2345 Trnx 2 (consider as 2nd cus_number)
2345 Trnx 3 (consider as 2nd cus_number)
3456 Trnx 1 (consider as 3rd cus_number)

Many thanks for your help

Re: spilt large dataset by group

PeterClemmensen — Wed, 26 Oct 2022 10:30:35 GMT

"Planning to spilt into several dataset by every 100,000 of cus_number". I don't understand this. Can you clarify?

So, in your posted data, do you simply want to split the data by Cus_Number? Or do you want the first 100.000 encountered Cus_Numbers to be in one data set, the next 100.000 encountered Cus_Numbers in another and so on?

Re: spilt large dataset by group

abx — Wed, 26 Oct 2022 11:14:57 GMT

First 100,000 encountered cus_number to be in dataset 1, the subsequent 100,000 encountered cus_number in dataset 2 and so on.

Re: spilt large dataset by group

PeterClemmensen — Wed, 26 Oct 2022 11:19:34 GMT

Ok. Is the data sorted like in the posted sample data?

Re: spilt large dataset by group

abx — Wed, 26 Oct 2022 11:21:55 GMT

yes. sorted by cus_number

Re: spilt large dataset by group

PeterClemmensen — Wed, 26 Oct 2022 11:40:36 GMT

Ok. See if you can use this as a template. I just made up some data to resemble your problem.

The input data set has 55 unique cus_numbers and here I split by the first 10 distinct values in want_1, the next 10 distinct values in want_2 and so on.

Should be reasonably fast in your case as well.

data have;
   call streaminit(123);
   do cus_number = 1 to 55;
      do _N_ = 1 to rand('integer', 1, 3);
         v1 = _N_ * 2;
         v2 = _N_ * 3;
         output;
      end;
   end;
run;

data _null_;

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have(obs = 0)', multidata : 'Y', ordered : 'Y');
      h.definekey('cus_number');
      h.definedata(all : 'Y');
      h.definedone();
   end;

   set have end = z;
   by cus_number;

   if last.cus_number then c + 1;

   h.add();

   if c = 10 | z then do;
      n + 1;

      h.output(dataset : cats('want_', n));
      h.clear();

      c = 0;
   end;

run;

Result (Want_1):

cus_number v1 v2
1          2  3
1          4  6
2          2  3
3          2  3
4          2  3
4          4  6
5          2  3
6          2  3
6          4  6
7          2  3
7          4  6
8          2  3
9          2  3
10         2  3

Re: spilt large dataset by group

PaigeMiller — Wed, 26 Oct 2022 11:51:04 GMT

What is the benefit of doing such a split?

Re: spilt large dataset by group

Ksharp — Wed, 26 Oct 2022 12:21:16 GMT

If you don't care about running time, could try this simple way.

data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;
data _null_;
 set temp;
 call execute(catt('data Cus_',Cus_Number,';set have;if Cus_Number=',Cus_Number,';run;' ));
run;

Re: spilt large dataset by group

abx — Thu, 27 Oct 2022 13:36:35 GMT

Sorry. I am still quite at the beginner stage on coding.
Can't really replicate into my current code. Would like to clarify:-
1. do cus_number = 1 to 55; < this is based on the sequence of cus_number (i.e. 1, 2, 3 and not 1234, 2345)?
2. do _N_ = 1 to rand('integer', 1, 3);
v1 = _N_ * 2; < what does v1 and v2 mean?
v2 = _N_ * 3;
3. I actually have these variables in the dataset to split by customer name. List of variable: cus_number, name, ID_Number, Account_Number, Remitter_Name.

Re: spilt large dataset by group

abx — Thu, 27 Oct 2022 13:37:58 GMT

Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.

Re: spilt large dataset by group

PaigeMiller — Thu, 27 Oct 2022 13:43:15 GMT

So I asked "what is the benefit"? And your reply was:

@abx wrote:
Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.

I guess I could read between the lines and make some guesses, but I'd rather have you tell me directly ... how does splitting this data set up help?

Re: spilt large dataset by group

abx — Thu, 27 Oct 2022 13:49:34 GMT

Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.

Re: spilt large dataset by group

ballardw — Thu, 27 Oct 2022 14:13:27 GMT

@abx wrote:
Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.

Still have not answered why the split is so important to the process.

In a very large number of cases it is beneficial to add a variable that describes the "group" of interest and process using BY processing.

Re: spilt large dataset by group

Ksharp — Fri, 28 Oct 2022 11:53:29 GMT

If it was big table,

you could write ONE data step for all.

data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;

filename x temp;
data _null_;
file x;
put 'data ';
 do until(last1);
  set temp end=last1;
  put 'Cus_'  Cus_Number;
 end;
put ';set have; select (Cus_Number);';
 do until(last2);
  set temp end=last2;
  put 'when(' Cus_Number ') output Cus_' Cus_Number ';';
 end;
put 'otherwise;end;run;';
stop;
run;

%include x/source;

Re: spilt large dataset by group

Astounding — Fri, 28 Oct 2022 14:44:01 GMT

If I understand correctly, I would agree that you should just produce one data set. Add a variable that identifies which batch the current customer belongs to (1 for the first 100,000 customers, 2 for the next 100,000 customers, etc.). That's actually fairly easy to do:

data want;
   set have;
   by cus_number;
   retain group 1 customer_count;
   if first.cus_number then customer_count + 1;
   output;
   if last.cus_number and customer_count = 100000 then do;
      customer_count = 0;
      group + 1;
   end;
run;

(Maybe not that easy ... it took me 3 tries to simplify the logic down to this final state.)