Seeking Elegant way to run iteration loop for multiple input datasets

mfp · Posted 11-25-2019 07:47 PM

Thanks to the SAS Community, I was able to create a Do loop that calls a macro and substitutes a text string, one at a time. Long story short, while it works great, when the # of iterations starts to climb, the iteration per second balloons. The root cause is that at the end of each iteration of the called macro, the resulting values are appended as a row in a results dataset. Reading/writing that dataset slows as its size increases.

I think I can resolve this by splitting the input dataset (testcomb2) containing the text string into multiple smaller datasets. So, instead of 1.6 million rows to be run in a row, I could create 16 smaller datasets of 100,00 rows; each set of 100,000 would be a discrete completion of the Do loop. That way, the size of the results file never gets to big to bog down. I found some code that should accomplish splitting into distinct datasteps.

I suppose I could simply repeat the code for the iteration loop 16 times, each time naming a different one of the smaller input sets and creating a results dataset with a slightly different name, but I am sure smarter people than me can suggest a more eloquent method, especially since it needs to be scalable (1.6 million this run, 3.2 million (32 sets of 100,00) the next, 500,000 (5 sets of 100,000) etc, the next.

Perhaps, I don't need to split into smaller datasteps, simply tell the loop to process List Number 1-100,000, end the loop, then repeat the loop for 100,001 through 200,000, etc....

Or, perhaps I don't need to split either the input dataset, or the # of iterations in the loop. Rather, when the results dataset reaches 100,00 obs, it spits that out as finished and starts a new one (with a slightly different name counter)

Here is the loop.

**************************************************************************************;
** G-6: LOOPING MARCO                               								**; 
**************************************************************************************;
%ODSOff
%macro test;                                                                                                                            

proc sql noprint;
 select count(*) into :n
  from testcomb2
 ;
quit;

%do i=1 %to &n;
 data _null_;
  set testcomb2;
  where list_number = &i ;
  call symputx('changelist',combos2);
  run;  

%US13_TARGET_ANALYSIS;	
                                                                                                                                        
%end;                                                                                                                                   

%mend;                                                                                                                                  
                                                                                                                                        
%test
%ODSOn

Tom · Posted 11-25-2019 07:59 PM

PROC APPEND is the right tool for adding data to a dataet that will eliminate have constantly read/write the same data over and over.

%macro mymac ...
...
%do ...
.... create dataset with the new observations, call it NEXT for example...
proc append base=all_results data=next force;
run;
%end;
%mend ;

* Make sure ALL_RESULTS is empty/non-existent before starting;
proc delete data=all_results;
run;
%mymac;

* Now analyze ALL_RESULTS ;

mfp · Posted 11-25-2019 08:11 PM

Thanks Tom. Perhaps I did not explain correctly, its the act of appending that slows things down as the number of appended observations gets into the hundreds of thousands. Is there a way to stop appending once the observations hits a limit (100,000) and start a new results dataset? Alternatively, can the # of iterations be set to stop after that limit, and then the loop restarted at the 100,001 observation through the 200,00th, etc. I could also limit the size of the input dataset to 100,000. In all three alternatives, the idea is to limit the size of the appended dataset to 100,000 obs, regardless of whether you limit the input data, the # of iterations, or the size of the appending results dataset.

Tom · Posted 11-25-2019 08:16 PM

You need to provide a lot more information about what you are actually doing to get anything very useful.

If it is not hard to limit the number of observations read from a dataset. You could use OBS= dataset option. Or stop after a number of iterations of the data step.

PaigeMiller · Posted 11-25-2019 08:22 PM

You haven't shown us the code in

%US13_TARGET_ANALYSIS;

so it's really hard to say how we can speed this up. I do agree with @Tom that PROC APPEND is the way to go. I also am sure that you can split the data up as you requested, and I am pretty sure that you won't need to do that, but we can't be sure without seeing your code.

--
Paige Miller

mfp · Posted 11-25-2019 08:32 PM

I am not trying to speed up the called macro by changing the data steps, I am trying to speed it up limiting the results dataset to one that is no bigger than 100,000. Of course, I also don't want to overwrite the dataset with each batch of 100,000. So it has to have a slightly different name. Here is the last step of the called macro, where the appending takes place:

proc append FORCE base=PMS data=TARGETMEANS2;
run; 

%end;

%MEND US13_TARGET_ANALYSIS;

Reeza · Posted 11-25-2019 08:37 PM

So it has to have a slightly different name.

No it doesn't. You don't use it beyond that step do you? You append it to the final data set so it's only the final data set you need unless you want to keep all the intermediary data sets for some reason. Make sure to clean it up so you don't accidentally re-use it but you don't necessarily need a unique name.

mfp · Posted 11-25-2019 08:42 PM

Just the opposite Reeza, that is the final dataset. I absolutely want to keep it forever, along with all of the other 100,000 obs datasets that are created. It gets finalized in the step after the end of the iteration loop:

%US13_TARGET_ANALYSIS;	
                                                                                                                                        
%end;                                                                                                                                   

%mend;                                                                                                                                  
                                                                                                                                        
%test
%ODSOn

**************************************************************************************;
** G7: TURN PRINTING BACK ON AND SAVE RESULTS                           			**;
**************************************************************************************;
proc printto;
run;

data company.&Total;  
set pms;
run;

In sum, it takes 2 hours to run 100,000 obs through the loop. But to run 200,000 it takes 5 hours, 300,000 = 9 hours, etc. So its mush faster to run 3 sets of 100,000, which will take 6 hours than 1 big set of 300,000 that will take 9 hours.

Reeza · Posted 11-25-2019 10:02 PM

But you're appending it to a final data set while you loop. You only keep the final appended data set, why do you need the intermediate data sets?

Here's a macro to split the data set by size and gives them unique names.

You could just modify the macro to run the process in the middle and then drop it after as well, which is shown below in orange. The link below has the original code to just split the files if that's what you'd like to do.

https://gist.github.com/statgeek/abc3c6ce1dbeedb84fe7f11da0603cda

%macro split (dsn=, size=, outDsnPrefix=Split);

    %*Get number of records and calculate the number of files needed;
    data _null_;
        set &dsn. nobs=_nobs;
        call symputx('nrecs', _nobs);
        n_files=ceil(_nobs/ &size.);
        call symputx('nfiles', n_files);
        stop;
    run;

    %*Set the start and end of data set to get first data set;
    %let first=1;
    %let last=&size.;
    
    %*Loop to split files;
    %do i=1 %to &nfiles;
    
        %*Split file by number of records;
        data &outDsnPrefix.&i.;
            set &dsn. (firstobs=&first obs=&last);
        run;

        %RunYourMacroHERE;

        proc append base=final data=&outDsnPrefix.&i. force;
        run;

        proc sql;
        drop table &outDsnPrefix.&i.;
        quit;
  


        %*Increment counters to have correct first/last;
        %let first = %eval(&last+1);
        %let last = %eval((&i. + 1)*&size.);
    %end;
%mend split;

*Example call;
%split(dsn=sashelp.cars, size=50, outDsnPrefix=Split);

Patrick · Posted 11-25-2019 10:26 PM

@mfp

Using Proc Append elapsed times for adding 100T rows to a table with 1M rows or 100M rows won't differ that much. I've done a test in my environment and the elapsed time only doubled.

I haven't tested it but I guess when there is an index on the base table then run times will increase more. So if there is an index and you'll adding a lot of slices then it's eventually better to first drop the index and re-create once you're done adding data.

If you've got SAS logs for processing with 100T, 200T and 300T records and you say that elapsed time doesn't increase linear then investigate the SAS logs to determine which step(s) actually increase exponentially. Is that really the append or something else? Let us know and then share at least the relevant code and log for these steps.

mfp · Posted 11-25-2019 10:40 PM

Hmmm! Is it possible the writing of the SAS log is causing the issue? I did note that the log file was several hundred MB in size when I ran about 500T obs?

Patrick · Posted 11-25-2019 10:49 PM

@mfp wrote:

Hmmm! Is it possible the writing of the SAS log is causing the issue? I did note that the log file was several hundred MB in size when I ran about 500T obs?

It certainly impacts on performance and you probably should reduce the logging level. But even several hundred MB don't explain the difference in run-times you observe unless you've got a very slow disk.

Reeza · Posted 11-26-2019 11:16 AM

@mfp wrote:

Hmmm! Is it possible the writing of the SAS log is causing the issue? I did note that the log file was several hundred MB in size when I ran about 500T obs?

You're trying to process 500 trillion observations? You initially mentioned 1.6 million rows.

Turn off all messages to the log except errors is a starting point.

ChrisNZ · Posted 11-25-2019 09:56 PM

1. This is very odd. proc append should take a proportionally longer time as the number of observations grows.

2. What are you appending to? a SAS data set?

3. This might answer your expressed need:

proc append FORCE base=PMS data=TARGETMEANS2(firstobs=&first_obs_nb. obs=&last_obs_nb);
run;

Do a macro loop to manage the observations being appended.

High-Performance SAS Coding - Third Edition

PaigeMiller · Posted 11-26-2019 06:42 AM

@mfp wrote:

I am not trying to speed up the called macro by changing the data steps, I am trying to speed it up limiting the results dataset to one that is no bigger than 100,000. Of course, I also don't want to overwrite the dataset with each batch of 100,000. So it has to have a slightly different name. Here is the last step of the called macro, where the appending takes place:
proc append FORCE base=PMS data=TARGETMEANS2;
run; 

%end;

%MEND US13_TARGET_ANALYSIS;

Okay, you want to keep the existing data sets etc etc etc but you have completely ignored the request to see your code so we can know what you are doing and make suggestions for improvement. So we don't know what you are doing, and so we can't advise.

--
Paige Miller

Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Re: Seeking Elegant way to run iteration loop for multiple input datasets

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!