BookmarkSubscribeRSS Feed
texasmfp
Lapis Lazuli | Level 10

I crafted a program that takes a large file, splits it into pieces, and then runs through all the obs in the 1st piece, then all the obs in the second piece, etc, until it goes through all the pieces.  Its crude, but it works except for one issue.  The original large dataset has a running list_number.  In my small sample dataset, it goes from 1 to 129 obs.  When I split it into 6 datasets of 25 or less, that list_number remains.  So piece 1 has a list number 1-25.  piece 2 is 26-50, etc.  Unfortunately, my loop resets the list_number to 1 as it starts with  piece 2, and so on and so on.  Total iterations of the loop within a loop is correct, but because the list_number in piece 2 starts with 26, and the loop is looking for list_number 1, it fails to read anything from piece 2.

 

So, I either need to keep the list_number counter from resetting back to 1 or, when I create the splits, somehow reset all of the list_numbers.

 

Attached is my code.  The US13 macro will eventually be filled with a regression analysis.  It just beeps for now while I am debugging.

 

DM 'CLEAR LOG; CLEAR OUTPUT'; RESETLINE;
options ExtendObsCounter=no;
options notes source source2 mprint mlogic symbolgen error=20;
LIBNAME COMPANY 'E:\SAS Data\Regression';

/*This macro splits a data set into data sets of size N. 
The parameters required are:

1. DSN = input data set name, such as sashelp.cars. 
   The libname should be included unless the data set
   is in the work library.
2. Size = Number of records to be included in each data 
   set. Note that the last data set will be truncated, 
   ie if only 28 records are available only 28 will be 
   written to the output data set.
3. outDsnPrefix = Name of output data sets, will be indexed as 
      outDSNPrefix1
      outDSNPrefix2
      outDSNPrefix3
*/

%macro split (dsn=, size=, outDsnPrefix=);

    %*Get number of records and calculate the number of files needed;
    data _null_;
        set &dsn. nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/ &size.);
        call symputx('nfiles', n_files);
        stop;
    run;

    %*Set the start and end of data set to get first data set;
    %let first=1;
    %let last=&size.;
    
    %*Loop to split files;
    %do i=1 %to &nfiles;
    
        %*Split file by number of records;
        data &outDsnPrefix.&i.;
            set &dsn. (firstobs=&first obs=&last);
        run;

        %*Increment counters to have correct first/last;
        %let first = %eval(&last+1);
        %let last = %eval((&i. + 1)*&size.);
    %end;
%mend split;

*Example call;
*After running this, you should find 9 data sets named Split1-Split9;
%split(dsn=company.testcomb130, size=25, outDsnPrefix=TESTCOMB);


%MACRO US13;

data _null_;
   call sound(440,200);
run;

%MEND US13;


%macro test;                                                                                                                            

proc sql noprint;
 select count(*) into :n
  from testcomb&padded_number
 ;
quit;

%do i=1 %to &n;
 data _null_;
  set testcomb&padded_number;
  where list_number = &i ;
  call symputx('changelist',combos2);
  run;  

%US13;	
                                                                                                                                        
%end;                                                                                                                                   

%mend;


%macro join;                                                                                                                           

data testcomb;
set company.testcomb130;
run;

proc sql noprint;
 select count(*) into :n
  from testcomB
 ;
quit;

data _null_;
        set company.testcomb130 nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/ 25);
        call symputx('nfiles', n_files);
        stop;
    run;


%let num_files = &nfiles;


%do z=1 %to &num_files;
    %let padded_number = %sysfunc(putn(&z, 4.0));

%test

%end;

%mend join;                                                                                                                                  
                                                                                                                               
%join
14 REPLIES 14
FreelanceReinh
Jade | Level 19

Hello @texasmfp,

 

As a quick fix you may want to introduce a parameter in macro TEST

%macro test(x);

and change the WHERE statement to

where list_number = &i+(&x-1)*25;

(ideally using a macro variable instead of the hardcoded "25", though).

 

Then you can pass the value of macro variable z to macro TEST in macro JOIN:

%test(&z)

 

gamotte
Rhodochrosite | Level 12

Hello,

 

You should tell us exactly what you want to do because

- SAS data step and procs have been designed to allow by group processing so that splitting datasets is seldom required.

- As such your macro does not seem to make much sense. For instance, in your test macro, you have a data null step in a loop, but this data step only contains a call symput with a constant first argument ("changelist") => &changelist. will have the value generated by the last call symput.

 

/* Test dataset */
data cars;
    set sashelp.cars;
run;

/* Splitting datasets */
%let DS_SIZE=5;

data _NULL_;
    set cars nobs=_nobs;
    call symputx('NB_DS',ceil(_nobs/&DS_SIZE.));
run;

data _NULL_;
    call execute('data ');

    do i=1 to &NB_DS.; call execute(cats('cars',i)); end;

    call execute('; set cars;');

    do i=1 to &NB_DS.;
        call execute(catx(' ','if', i-1,'< _N_/&DS_SIZE. <=', i, 'then output', cats('cars',i,';')));
    end;

    call execute('run;');
run;

/* Joining datasets */
data carsb;
    set cars1-cars&NB_DS.;
run;

proc compare base=cars comp=carsb;
run;
texasmfp
Lapis Lazuli | Level 10

gamotte:

 

I experienced ballooning iteration times as I run larger and larger datasets (i.e., as the number of changelist values I was testing increased, so did the run time/iteration.  Particularly in datasets from 250,00 up to 12 million obs.  After much troubleshooting, I narrowed it down to the size of the testcomb2 dataset.  I ran a test of 125k, 250k, 500k, and 1M obs and, sure enough, there was a direct correlation between the size and the iteration cycle run time.  That is, as the dataset gets larger, it takes longer to read through it to find the next changelist value to run through the US13 loop.  My program uses a few obs and beeps rather than uses the changelist value to speed things up.

 

So what I am trying to do is split the testcomb2 dataset into X number of pieces no larger than say 100k and then run each of those 100k pieces consecutively.  That will greatly reduce total run time compared to running a single file with 1M obs.

Tom
Super User Tom
Super User

@texasmfp wrote:

gamotte:

 

I experienced ballooning iteration times as I run larger and larger datasets (i.e., as the number of changelist values I was testing increased, so did the run time/iteration.  Particularly in datasets from 250,00 up to 12 million obs.  After much troubleshooting, I narrowed it down to the size of the testcomb2 dataset.  I ran a test of 125k, 250k, 500k, and 1M obs and, sure enough, there was a direct correlation between the size and the iteration cycle run time.  That is, as the dataset gets larger, it takes longer to read through it to find the next changelist value to run through the US13 loop.  My program uses a few obs and beeps rather than uses the changelist value to speed things up.

 

So what I am trying to do is split the testcomb2 dataset into X number of pieces no larger than say 100k and then run each of those 100k pieces consecutively.  That will greatly reduce total run time compared to running a single file with 1M obs.


Since you haven't explained what you are replacing the beeps with it is hard to evaluate whether it could be done against the full dataset using BY group processing.  If what you tested was code that first picked the middle 100K observations from a big dataset and then did something with it then of course it will run slower as the has to skip more and more observations to find that middle part to use.

 

If you just want to split into approximately equal size pieces then code @gamotte looks much simpler and clearer than the complicated code you posted. Once you have the pieces then run something to call your beeping program on each of the pieces.

Tom
Super User Tom
Super User

Your SPLIT macro looks ok, inefficient but workable.

So I am not sure I understand what you question is "list_number"?

 

What is that variable? how is it used?

 

Are you just trying to count where you were originally in the big dataset?  Couldn't you just store that into a variable in the split datasets.

 

Or are you trying calculate where you left off before? 

Does each step always use the same number, so that you could just use arithmetic to figure out that the Nth run starts at (N-1)*size+1? 

If not then why not just store the number at the end of one loop/run and read it back in at the start of the next.  Either into a global macro variable or into a dataset.

texasmfp
Lapis Lazuli | Level 10
Tom, the list_number is just a consecutive numbering of the observations (1-end, be it 100k, 500k, or 10M). When the smaller test file testcomb130 is split, it retains the same numbers. So split2 starts with a list_number of 26 in the dataset. However, the loop is searching for a list_number of 1, doesn't find it, and so it does not read in a changelist value. Two ways to skin the cat: renumber the list_numbers in the splits to restart at 1 or, to tell the loop to look for 26, 51, etc. I know what I want to do but, as is evident from my inefficient code, lack the programming depth to know how to effectuate one of those options. Hence the post.
Tom
Super User Tom
Super User

@texasmfp wrote:
Tom, the list_number is just a consecutive numbering of the observations (1-end, be it 100k, 500k, or 10M). When the smaller test file testcomb130 is split, it retains the same numbers. So split2 starts with a list_number of 26 in the dataset. However, the loop is searching for a list_number of 1, doesn't find it, and so it does not read in a changelist value. Two ways to skin the cat: renumber the list_numbers in the splits to restart at 1 or, to tell the loop to look for 26, 51, etc. I know what I want to do but, as is evident from my inefficient code, lack the programming depth to know how to effectuate one of those options. Hence the post.

Why is the inner loop using the observation number from the larger dataset instead of the observation number from the dataset it is currently using?

 

Any way it sounds like you just need to do the arithmetic.  If you know that each file has 100 observations and you are processing file number 2 then you are processing 101,102,.... 200. To get from 2 to 101 you multiple 100 times (2-1) and add 1.

 

What macro variable in you code has the size (number of observations) of each file?  Let's call in FILE_SIZE.

What macro variable has the file number you are currently processing? Lets call in FILE_NUMBER.

What macro variable is it that you want to count from 101 instead of from 1?  Let's call it  START_NUMBER.

%let start_number=%eval((&file_number-1)*&file_size+1);
texasmfp
Lapis Lazuli | Level 10

Tom:

 

your equation was very close, but I had to adjust the logic since, when file_number is 1, your equation resolved to zero.

 

This is what worked.  I created counters for the start, but also the end number, as otherwise if the last dataset had 1 obs, but the file size was set to 10,000, then it ran an additional 9,999 iterations.

 

Its not the prettiest, but it works.  If anybody has improvements to this working solution, I welcome the insight.  Thanks

 

%macro test;                                                                                                                            

proc sql noprint;
 select count(*) into :n
  from testcomb&padded_number
 ;
quit;

%let start_number=%eval(((&padded_number*&file_size)-&file_size)+1);
%let end_number=%eval(&start_number+&n);

%do i=&start_number %to &end_number;
 data _null_;
  set testcomb&padded_number;
  where list_number = &i ;
  call symputx('changelist',combos2);
  run;  

%US13;	
                                                                                                                                        
%end;                                                                                                                                   

%mend;


%macro join;                                                                                                                           

proc sql noprint;
 select count(*) into :n
  from &test_file
 ;
quit;

data _null_;
        set &test_file nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/&file_size);
        call symputx('nfiles', n_files);
        stop;
    run;


%let num_files = &nfiles;


%do z=1 %to &num_files;
    %let padded_number = %sysfunc(putn(&z, 4.0));

%test

%end;

%mend join;                                                                                                                                  
                                                                                                                               
%join
Tom
Super User Tom
Super User

Why are you counting the number of observations using an SQL query and then counting them again by using the NOBS= option the SET statement?  You will save a lot of time by just using the count from the NOBS option.  If you don't trust the NOBS option then just use the count you get from the query instead.

 

This statement makes no sense at all. What are you trying to do here?

 

 %let padded_number = %sysfunc(putn(&z, 4.0));

This loop seems silly. 

%do i=&start_number %to &end_number;
 data _null_;
  set testcomb&padded_number;
  where list_number = &i ;
  call symputx('changelist',combos2);
  run;  

%US13;	
                                                                                                                                        
%end;    

It looks like you just want to do:

data _null_;
  set testcomb&padded_number;
  call execute(cats
('%nrstr(%let) changelist=',combos2,';'
,'%nrstr(%us13;)'
));
run;

In which case you probably do not need to worry about the LIST_NUMBER variable and what its value was.  But if you did need to put its value into the macro variable I you can add that to the string that you pass to call execute.

Personally I prefer writing the code to a file instead as I find it much easier to debug than call execute.

file code temp;
data _null_;
  set testcomb&padded_number;
  file code;
  put '%let changelist=' combos2 ';'
    / '%us13;'
  ;
run;
%include code / source2;

Also I cannot figure out what TESTCOM&padded_number is.  Didn't you use the SPLIT macro to convert some big dataset into a series of smaller datasets? Didn't you just number the dataset as TESTCOMB1,TESTCOMB2, ...?  Why are you looking for datasets named with numbers like TESTCOMB1, TESTCOM101,TESTCOMB201 ?

 

 

texasmfp
Lapis Lazuli | Level 10

Tom:  Thanks for the suggestion.  I have no idea, I just borrowed code I found online.  I assume you are referring to the 2nd SQL, within the JOIN macro.  So I can delete the SQL code and just use the remaining?

 

proc sql noprint;
 select count(*) into :n
  from &test_file
 ;
quit;

data _null_;
        set &test_file nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/&file_size);
        call symputx('nfiles', n_files);
        stop;
    run;

I am also not sure what this line does.  I think I can delete it as well.

 

        call symputx('nrecs', _nobs);
Tom
Super User Tom
Super User

That line is setting NRECS macro variable to the number of observations. So NRECS and N should have the same value. Although the value in NRECS will be easier to work with since it will not have leading spaces (or switch to scientific notation if the value is so large it needs more than 12 digits).

texasmfp
Lapis Lazuli | Level 10
OK, so I can just delete this:

proc sql noprint;
select count(*) into :n
from &test_file
;
quit;

Every millisecond counts!
texasmfp
Lapis Lazuli | Level 10

Tom:  i did not see your additional comments.  I will try this.

data _null_;
  set testcomb&padded_number;
  call execute(cats
('%nrstr(%let) changelist=',combos2,';'
,'%nrstr(%us13;)'
));
run;

As for question on the the padded_number and the testcomb1 etc files, the two questions are related.

 

The padded_number creates a number to add to the split files.  I didn't just create TESTCOMB1, TESTCOMB2, I need to call those sequentially.  So I create the padded_number which then attaches to the TESTCOMB name to then call TESTCOMB1, then TESTCOMB2, etc.  I can't hardcode TESTCOMB1, 2, 3 etc, because this has to be scalable.  I run 14 series of changelist values per program.  The size of each changelist series which grow in size exponentially with each series.  I also run 36 programs simultaneously, using batch submit.  Each of those programs has, in turn, 14 series of changelist values and every series has a different number than the other 35 programs.

 

 

texasmfp
Lapis Lazuli | Level 10

Tom:  thanks for your suggestions.  I did test runs with that SQL statement, without it, and then again with it.  I also ran a control program concurrently, which is identical other than it does not try to split the file into pieces.  Both programs ran in batch mode.

 

I can tell you that dropping that SQL code increased the run time compared to with.  Running it with it in again corroborated the results.  I am not sure how deleting code can increase the run time, but it did.

 

I also ran your suggested code. 

 

data _null_;
  set testcomb&padded_number;
  call execute(cats
('%nrstr(%let) changelist=',combos2,';'
,'%nrstr(%us13;)'
));
run;

It failed.  I am going to stick with a running program for now.  Though I will probably look for efficiency improvements in the near future.  Thanks for your suggestions.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 2295 views
  • 0 likes
  • 4 in conversation