BookmarkSubscribeRSS Feed
abx
Fluorite | Level 6 abx
Fluorite | Level 6
Hi. I have a large dataset that requires to spilt by cus_number. Planning to spilt into several dataset by every 100,000 of cus_number. there will be same cus_number in the dataset as the dataset contains transaction data too.

any idea how i can spilt the data based on the above scenario?

example:
Cus_Number Trnx
1234 Trnx 1 (consider as 1st cus_number)
1234 Trnx 2 (consider as 1st cus_number)
2345 Trnx 1 (consider as 2nd cus_number)
2345 Trnx 2 (consider as 2nd cus_number)
2345 Trnx 3 (consider as 2nd cus_number)
3456 Trnx 1 (consider as 3rd cus_number)

Many thanks for your help

14 REPLIES 14
PeterClemmensen
Tourmaline | Level 20

"Planning to spilt into several dataset by every 100,000 of cus_number". I don't understand this. Can you clarify?

 

So, in your posted data, do you simply want to split the data by Cus_Number? Or do you want the first 100.000 encountered Cus_Numbers to be in one data set, the next 100.000 encountered Cus_Numbers in another and so on?

abx
Fluorite | Level 6 abx
Fluorite | Level 6
First 100,000 encountered cus_number to be in dataset 1, the subsequent 100,000 encountered cus_number in dataset 2 and so on.
PeterClemmensen
Tourmaline | Level 20

Ok. Is the data sorted like in the posted sample data?

abx
Fluorite | Level 6 abx
Fluorite | Level 6
yes. sorted by cus_number
PeterClemmensen
Tourmaline | Level 20

Ok. See if you can use this as a template. I just made up some data to resemble your problem. 

 

The input data set has 55 unique cus_numbers and here I split by the first 10 distinct values in want_1, the next 10 distinct values in want_2 and so on.

 

Should be reasonably fast in your case as well.

 

data have;
   call streaminit(123);
   do cus_number = 1 to 55;
      do _N_ = 1 to rand('integer', 1, 3);
         v1 = _N_ * 2;
         v2 = _N_ * 3;
         output;
      end;
   end;
run;

data _null_;

   if _N_ = 1 then do;
      dcl hash h(dataset : 'have(obs = 0)', multidata : 'Y', ordered : 'Y');
      h.definekey('cus_number');
      h.definedata(all : 'Y');
      h.definedone();
   end;

   set have end = z;
   by cus_number;

   if last.cus_number then c + 1;

   h.add();

   if c = 10 | z then do;
      n + 1;

      h.output(dataset : cats('want_', n));
      h.clear();

      c = 0;
   end;

run;

 

Result (Want_1):

 

cus_number v1 v2
1          2  3
1          4  6
2          2  3
3          2  3
4          2  3
4          4  6
5          2  3
6          2  3
6          4  6
7          2  3
7          4  6
8          2  3
9          2  3
10         2  3
abx
Fluorite | Level 6 abx
Fluorite | Level 6
Sorry. I am still quite at the beginner stage on coding.
Can't really replicate into my current code. Would like to clarify:-
1. do cus_number = 1 to 55; < this is based on the sequence of cus_number (i.e. 1, 2, 3 and not 1234, 2345)?
2. do _N_ = 1 to rand('integer', 1, 3);
v1 = _N_ * 2; < what does v1 and v2 mean?
v2 = _N_ * 3;
3. I actually have these variables in the dataset to split by customer name. List of variable: cus_number, name, ID_Number, Account_Number, Remitter_Name.

PaigeMiller
Diamond | Level 26

What is the benefit of doing such a split?

--
Paige Miller
abx
Fluorite | Level 6 abx
Fluorite | Level 6
Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.
PaigeMiller
Diamond | Level 26

So I asked "what is the benefit"? And your reply was:

 


@abx wrote:
Planning to split the large data to do some fuzzy logic on the remitter_name based on the cus_number.

I guess I could read between the lines and make some guesses, but I'd rather have you tell me directly ... how does splitting this data set up help?

--
Paige Miller
Ksharp
Super User

If you don't care about running time, could try this simple way.

 

data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;
data _null_;
 set temp;
 call execute(catt('data Cus_',Cus_Number,';set have;if Cus_Number=',Cus_Number,';run;' ));
run;
abx
Fluorite | Level 6 abx
Fluorite | Level 6
Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.
ballardw
Super User

@abx wrote:
Thank you! It works! but I am have really large dataset. Splitting by each cus_number is quite a lot for me.

Still have not answered why the split is so important to the process.

 

In a very large number of cases it is beneficial to add a variable that describes the "group" of interest and process using BY processing.

Ksharp
Super User

If it was big table,

you could write ONE data step for all.

 

data have;
 set sashelp.class;
 rename age=Cus_Number ;
run;



proc freq data=have noprint;
table Cus_Number /out=temp;
run;

filename x temp;
data _null_;
file x;
put 'data ';
 do until(last1);
  set temp end=last1;
  put 'Cus_'  Cus_Number;
 end;
put ';set have; select (Cus_Number);';
 do until(last2);
  set temp end=last2;
  put 'when(' Cus_Number ') output Cus_' Cus_Number ';';
 end;
put 'otherwise;end;run;';
stop;
run;

%include x/source;
Astounding
PROC Star

If I understand correctly, I would agree that you should just produce one data set.  Add a variable that identifies which batch the current customer belongs to (1 for the first 100,000 customers, 2 for the next 100,000 customers, etc.).  That's actually fairly easy to do:

data want;
   set have;
   by cus_number;
   retain group 1 customer_count;
   if first.cus_number then customer_count + 1;
   output;
   if last.cus_number and customer_count = 100000 then do;
      customer_count = 0;
      group + 1;
   end;
run;

(Maybe not that easy ... it took me 3 tries to simplify the logic down to this final state.) 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 812 views
  • 1 like
  • 6 in conversation