BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Eana
Calcite | Level 5

I've got some data that's just a long list of account numbers.

I need it to be split up into batches of 100 each. 

 

Now, I can get it working normally, but I cannot get it working in a macro.

The normal data steps look like this:

 

data Batch_1;
set Accounts (firstobs=1. obs=100);
run;

data _null_;
if n=<100 then stop;
else call execute('data Batch_2; set Accounts (firstobs=101 obs=200); run;');
stop;
set Accounts nobs=n;
run;

data _null_;
if n=<200 then stop;
else call execute('data Batch_3; set Accounts (firstobs=201 obs=300); run;');
stop;
set Accounts nobs=n;
run;

That works fine. It'll only create the amount of data sets I actually need depending on the amount of account numbers I have that day.

 

But I wanted to put it into a macro and made it look like this:

%Macro Batches;

%do i=1 %to 3;
	
	data _null_;
		%if n=<((&i.*100)-100) %then %do; 
			stop;
		%end;
		%else %do; 
		
			call execute('data Batch_&i.; set Accounts (firstobs=((&i.*100)-99) obs=(&i.*100)); run;');
			stop;
		%end;
	set Accounts nobs=n;
	run; 
%END;

%mend Batches;

%Batches()

But I get the following error:

 

1 + data Batch_1; set Accounts (firstobs=((1*100)-99) obs=(1*100)); run;
_
23
ERROR 23-7: Invalid value for the FIRSTOBS option.

1 !+data Batch_1; set Accounts (firstobs=((1*100)-99) obs=(1*100)); run;
_
23
ERROR: Invalid number conversion on (.
ERROR: Invalid number conversion on (.

ERROR 23-7: Invalid value for the OBS option.

 

So is it because I can't us a calculated value for firstobs and obs= or is it something else?

Is there any way to make this work or another way to make a macro do the same thing?

 

I obviously need to make far more than just 3 batches normally, which is why I'd prefer a macro. 

 

Thanks 🙂

 

1 ACCEPTED SOLUTION

Accepted Solutions
Reeza
Super User
/*This macro splits a data set into data sets of size N. 
The parameters requried are:
1. DSN = input data set name, such as sashelp.cars. 
   The libname should be included unless the data set
   is in the work library.
2. Size = Number of records to be included in each data 
   set. Note that the last data set will be truncated, 
   ie if only 28 records are available only 28 will be 
   written to the output data set.
3. outDsnPrefix = Name of output data sets, will be indexed as 
      outDSNPrefix1
      outDSNPrefix2
      outDSNPrefix3
*/

%macro split (dsn=, size=, outDsnPrefix=Split);

    %*Get number of records and calculate the number of files needed;
    data _null_;
        set &dsn. nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/ &size.);
        call symputx('nfiles', n_files);
        stop;
    run;

    %*Set the start and end of data set to get first data set;
    %let first=1;
    %let last=&size.;
    
    %*Loop to split files;
    %do i=1 %to &nfiles;
    
        %*Split file by number of records;
        data &outDsnPrefix.&i.;
            set &dsn. (firstobs=&first obs=&last);
        run;

        %*Increment counters to have correct first/last;
        %let first = %eval(&last+1);
        %let last = %eval((&i. + 1)*&size.);
    %end;
%mend split;

*Example call;
*After running this, you should find 9 data sets named Split1-Split9;
%split(dsn=sashelp.cars, size=50, outDsnPrefix=Split);

https://gist.github.com/statgeek/abc3c6ce1dbeedb84fe7f11da0603cda

View solution in original post

5 REPLIES 5
Reeza
Super User
/*This macro splits a data set into data sets of size N. 
The parameters requried are:
1. DSN = input data set name, such as sashelp.cars. 
   The libname should be included unless the data set
   is in the work library.
2. Size = Number of records to be included in each data 
   set. Note that the last data set will be truncated, 
   ie if only 28 records are available only 28 will be 
   written to the output data set.
3. outDsnPrefix = Name of output data sets, will be indexed as 
      outDSNPrefix1
      outDSNPrefix2
      outDSNPrefix3
*/

%macro split (dsn=, size=, outDsnPrefix=Split);

    %*Get number of records and calculate the number of files needed;
    data _null_;
        set &dsn. nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/ &size.);
        call symputx('nfiles', n_files);
        stop;
    run;

    %*Set the start and end of data set to get first data set;
    %let first=1;
    %let last=&size.;
    
    %*Loop to split files;
    %do i=1 %to &nfiles;
    
        %*Split file by number of records;
        data &outDsnPrefix.&i.;
            set &dsn. (firstobs=&first obs=&last);
        run;

        %*Increment counters to have correct first/last;
        %let first = %eval(&last+1);
        %let last = %eval((&i. + 1)*&size.);
    %end;
%mend split;

*Example call;
*After running this, you should find 9 data sets named Split1-Split9;
%split(dsn=sashelp.cars, size=50, outDsnPrefix=Split);

https://gist.github.com/statgeek/abc3c6ce1dbeedb84fe7f11da0603cda

Tom
Super User Tom
Super User

There is a lot of confusion going on in that macro.

 

First let's separate the indentation of the SAS code from the indentation of the MACRO statements to make it easier to follow the logic flow.

%macro Batches;
  %do i=1 %to 3;
data _null_;
    %if n=<((&i.*100)-100) %then %do; 
  stop;
    %end;
    %else %do; 
  call execute('data Batch_&i.; set Accounts (firstobs=((&i.*100)-99) obs=(&i.*100)); run;');
  stop;
    %end;
  set Accounts nobs=n;
run; 
  %end;
%mend Batches;

First why are trying to generate three data _null_ steps in a %DO loop?  What significance does the number three have to this problem?

 

The condition in the first %IF statement makes no sense.  Why are you trying to compare the lowercase letter N to some number?  The macro processor cannot access the value of a data step variable.  And your macro is not creating any macro variable named N.  And even if it did you would need to add an & to reference its value instead of the string n.

 

I suspect you wanted to use the LESS THAN OR EQUAL TO operator.  But instead of using the <= symbol or the LE neumonic that the macro processor recognizes you used the two operators = and <.  So the result was to test if (n<) is less then values like 0 or 100 or 200, depending on the value of &I . Since FALSE evaluates to 0 the overall condition is FALSE when &I=1 and TRUE for larger values of &I.

 

So the first time through the loop the macro generates this data step:

MPRINT(BATCHES):   data _null_;
MPRINT(BATCHES):   call execute('data Batch_&i.; set Accounts (firstobs=((&i.*100)-99) obs=(&i.*100)); run;');
MPRINT(BATCHES):   stop;
MPRINT(BATCHES):   set Accounts nobs=n;
MPRINT(BATCHES):   run;

Not sure why it has the SET statement then since it cannot be executed and the variable N that it creates is not referenced anywhere in the code.

 

Since the string being passed to CALL EXECUTE is inside single quotes that exact string is what gets passed.

data Batch_&i.; set Accounts (firstobs=((&i.*100)-99) obs=(&i.*100)); run;

One thing to notice is that now we didn't need the data _null_ or the CALL EXECUTE() call to do that.  The macro could have just skipped the data _null_ and generated that code itself.

 

The macro variable references will be replaced while CALL EXECUTE() is pushing that string onto the stack to run after the data _null_ step finishes.  Since it only runs when &I is 1 that means it tries to run this data step:

data Batch_1;
  set Accounts (firstobs=((1*100)-99) obs=(1*100)); 
run;

Now the first and the last statements look fine, but that middle one has a problem.  The FIRSTOBS= and OBS= dataset options want numbers, not strings (or strings that look to humans like equations).

 

The main mistake was using the macro language where you wanted to use actual SAS code.

data accounts;
 do acct=1 to 101; output; end;
run;

data _null_;
  i=0;
  do while(i*100 <= n);
    firstobs=i*100+1;
    i+1;
    call execute(catx(' '
      ,'data',cats('Batch_',i),';'
      ,'set Accounts (firstobs=',firstobs,'obs=',min(n,firstobs+99),');'
      ,'run;'
    ));
  end;
  stop;
  set Accounts nobs=n;
run; 

Results when N=101.

1003  data _null_;
1004    i=0;
1005    do while(i*100 <= n);
1006      firstobs=i*100+1;
1007      i+1;
1008      call execute(catx(' '
1009        ,'data',cats('Batch_',i),';'
1010        ,'set Accounts (firstobs=',firstobs,'obs=',min(n,firstobs+99),');'
1011        ,'run;'
1012      ));
1013    end;
1014    stop;
1015    set Accounts nobs=n;
1016  run;

NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


NOTE: CALL EXECUTE generated line.
1    + data Batch_1 ; set Accounts (firstobs= 1 obs= 100 ); run;

NOTE: There were 100 observations read from the data set WORK.ACCOUNTS.
NOTE: The data set WORK.BATCH_1 has 100 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


2    + data Batch_2 ; set Accounts (firstobs= 101 obs= 101 ); run;

NOTE: There were 1 observations read from the data set WORK.ACCOUNTS.
NOTE: The data set WORK.BATCH_2 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

If you did want to make that into a macro then there might be some value if you made a macro that took some parameters so you could reuse it for other input datasets and other blocking factors.

%macro batches(in,out,size);
data _null_;
  i=0;
  do while(i*&size. <= n);
    firstobs=i*&size.+1;
    i+1;
    call execute(catx(' '
      ,'data',cats("&out.",i),';'
      ,"set &in.(firstobs=",firstobs,'obs=',min(n,firstobs+(&size.-1)),');'
      ,'run;'
    ));
  end;
  stop;
  set &in.(drop=_all_) nobs=n;
run; 
%mend batches;
%batches(in=Accounts,out=Batch_,size=100)

Results

1037  %batches(in=Accounts,out=Batch_,size=100)
MPRINT(BATCHES):   data _null_;
MPRINT(BATCHES):   i=0;
MPRINT(BATCHES):   do while(i*100 <= n);
MPRINT(BATCHES):   firstobs=i*100+1;
MPRINT(BATCHES):   i+1;
MPRINT(BATCHES):   call execute(catx(' ' ,'data',cats("Batch_",i),';' ,"set
Accounts(firstobs=",firstobs,'obs=',min(n,firstobs+(100-1)),');' ,'run;' ));
MPRINT(BATCHES):   end;
MPRINT(BATCHES):   stop;
MPRINT(BATCHES):   set Accounts(drop=_all_) nobs=n;
MPRINT(BATCHES):   run;

MPRINT(BATCHES):   data Batch_1 ;
MPRINT(BATCHES):   set Accounts(firstobs= 1 obs= 100 );
MPRINT(BATCHES):   run;
MPRINT(BATCHES):   data Batch_2 ;
MPRINT(BATCHES):   set Accounts(firstobs= 101 obs= 101 );
MPRINT(BATCHES):   run;
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


NOTE: CALL EXECUTE generated line.
1    + data Batch_1 ; set Accounts(firstobs= 1 obs= 100 ); run;

NOTE: There were 100 observations read from the data set WORK.ACCOUNTS.
NOTE: The data set WORK.BATCH_1 has 100 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds


2    + data Batch_2 ; set Accounts(firstobs= 101 obs= 101 ); run;

NOTE: There were 1 observations read from the data set WORK.ACCOUNTS.
NOTE: The data set WORK.BATCH_2 has 1 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds
Eana
Calcite | Level 5

Thanks a lot for the detailed reply. It has been helpful for understanding the finer points!

Patrick
Opal | Level 21

@Eana Not really sure what you really want to do and very often splitting a SAS table into many small tables is sub-optimal.

Below two options how to create such tables with a 100 obs each using SAS Base language only.

data accounts;
  do account_id=1 to 512;
    output;
  end;
run;

/* option 1 */
data _null_;
  if _n_=1 then
    do;
      dcl hash h1(dataset:'accounts(obs=0)',multidata:'y');
      h1.defineKey('account_id');
      h1.defineData(all:'y');
      h1.defineDone();
    end;
  set accounts end=last;
  _rc=h1.add();
  if mod(_n_,100)=0 or last then
    do;
      _rc=h1.output(dataset:cats('work.want1_', put(ceil(_n_/100),z10.)));
      _rc=h1.clear();
    end;
run;

/* option 2 */
data _null_;
  do i=1 to nobs by 100;
    cmd=cats('data work.want2_', put(i,z10.),';');
    cmd=catx(' ',cmd,'set accounts(firstobs=',i,'obs=',i+99,'); run;');
    call execute(cmd);
  end;
  stop;
  set accounts nobs=nobs;
run;
Patrick
Opal | Level 21

@Eana And just to elaborate on my statement "very often splitting a SAS table into many small tables is sub-optimal".

It's very often much easier to code for and more efficient to run using group processing. Many SAS procedures support BY and/or CLASS statements that allow to process data by group.

Below how you could create such a group variable.

data accounts;
  do account_id=1 to 512;
    output;
  end;
run;

data grouped;
  set accounts;
  group_id=ceil(_n_/100);
run;

SAS Macro programming is mostly for creation of dynamic code and only something you should be using if you can't do it using "normal" SAS code. 

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 939 views
  • 3 likes
  • 4 in conversation