Re: Parallel processing for a do loop

quarterpastix · Posted 10-24-2017 03:13 AM

I'm trying to reduce the computational time of a program which uses large datasets. The outline of the program is as follows:

1.create an empty base table A.

2. do the following for I=0 to 35;

3. create a table B_I;

4. append the table B_I to A;

5. end do;

Can someone suggest me some possible techniques to reduce the computational time of the code or if the code can be run parallel by breaking it into small steps?

Kurt_Bremser · Posted 10-24-2017 03:31 AM

Write the program in such a manner that table A is created in one step.

You could only parallelize the creation of the individual datasets, but the append/concatenation has to be done by one process, as only one process can have write access to table A at any given time.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

RW9 · Posted 10-24-2017 04:44 AM

Thats all well and good, but you haven't really shown us anything. From what you post:

data a;
  /* create empty table part */
  length var1 $20 var2 8;

  /* append data part */
  do i=1 to 35;
    /* insert tableb creation here */
  end;
run;

By creatinig data separately and then appending you are creating 35 * read/writes plus headers for each of them.

quarterpastix · Posted 10-24-2017 05:10 AM

The code roughly looks like this.

%macro account ();
	
	PROC SQL;
		create table temp.base(
		    var1 length=8,
			var2 length=8,
			var3 length=8,
			var4 length=8,
			var5 length=8,
			var6 length=8,
			var7 length=8,
			);
	QUIT;

	%do i = 0 %to &number.;

		DATA _NULL_;
			*--- determine year-month of loop in sas date;
			YYYY_MM_SAS = INTNX("MONTH", "&MORT_S_DATE."D, &I.);
			CALL SYMPUTX ('YYYY_MM_SAS', YYYY_MM_SAS);

			*--- determine month of loop in YYYY_MM format;
			CALL SYMPUTX ('YYYY_MM', tranwrd(put(YYYY_MM_SAS, yymmd8.),"-","_"));

			*--- determine month of loop in YYYYMM format;
			CALL SYMPUTX ('YYYYMM', put(YYYY_MM_SAS, yymmn6.));
		RUN;

		DATA add&YYYY_MM._PERF;
			LENGTH 
				var1 length=8
				var2 length=8
				var3 length=8
				var4 length=8
				var5 length=8
				var6 length=8
				var7 length=8
			;
			SET arrears_credit (keep=var1 var2 
				where=(year_month=&YYYYMM.));
			SET ACCOUNT_&YYYY_MM. (KEEP =  var 1 var3 var4 var5 var6 var7
				)
				KEY = var1 / UNIQUE;

			
		RUN;

		PROC APPEND BASE = base DATA = add&YYYY_MM._PERF;
		RUN;

		PROC SQL;
			DROP TABLE add&YYYY_MM._PERF;
		QUIT;

	%end;
%mend account;

RW9 · Posted 10-24-2017 05:35 AM

Yes, unfortunately a common issue this one, and all caused by the fact that date is a part of the dataset name. Think how much simpler the world would be if you had one dataset, with a fixed name and structure, and a column which contained the date, no looping or appending, just simple data. I don't know where this thinking came in but it does seem to account for more than half the questions on this forum.

Anyways (not tested):

%let mort_s_date=12JAN2014;
%let number=5;

data _null_;
  call execute('data base;');
  do i=1 to &number.;
    tmp=intnx("month","&mort_s_date."d, &i.);
    call execute('set arrears_credit (keep=var1 var2 where=(yearmonth='||put(tmp,yymm6.)||'));');
    call execute('set account_'||put(year(tmp),z4.)||'_'||put(month(tmp),z2.)||' (keep=var1 var3 var4 var5 var6 var7);');
  end;
run;

Parallel processing for a do loop