- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
I am using SAS 9.4M4 on Unix platform.
I have a situation where in a SAS program has nested loops. The input data is not huge, it has only 20-30 mil observations but due to the nested loops its taking almost 20-30 hours.
Below is the sample code:
data _null_;
set ds1;
call symput('a',compress(_n_));
run;
%do j=1 %to &a;
data _null_;
set ds(firstobs=&j obs=&j);
call symput('a',compress(a));
call symput('b',compress(b));
run;
data a;
set ds2(where=(a="&a" and b="&b"));
run;
proc sort data=a nodupkey out=c; by person; run;
data _null_;
set c;
call symput('c',compress(_n_));
run;
%do i=1 %to &c;
****inside this loop there are bunch of manipulation steps which are not time consuming as individual but the overall time related to the nested loops is alot. At the end it appends into one final DS.
Hence I was wondering if there is a way to parallelize the nested loop so that it can spawn at the same time and later, I can append. Appreciate your help and advice.
Thank you
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Why?
What is this actually supposed to do?This looks like a terribly over-complicated way to count how many records have some values.
Provide some small examples of the data sets and the desired result, including the manipulation rules you didn't show.
One of the most time intensive activities is reading from a data set and then writing to a data set. Here you are opening the same data set many times (Just how large is that not shown value for &a?) then selecting and overwriting another data set many times.
One suspects you might be using macro language where perhaps Proc SQL would be appropriate to match values from one data set to another.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What is it the inner steps are doing?
Why not just combine the two dataset and do the calculation using BY groups?
proc sql;
create table both as
select ds1.a,ds1.b,ds2.person,ds2.*
from ds1 inner join ds2
on ds1.a = ds2.a and ds1.b=ds2.b
order by 1,2,3
;
quit;
data want;
set both;
by a b person;
.... calculations ...
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
This is the least efficient way to retrieve the number of observations into a macro variable:
data _null_;
set ds1;
call symput('a',compress(_n_));
run;
You read the whole dataset and call the routine CALL SYMPUT in every observation. Use DICTIONARY.TABLES instead:
proc sql noprint;
select nobs into :a
from dictionary.tables
where libname = "WORK" and memname = "DS1";
quit;
Technically, this only reads the first dataset page and retrieves the value from there.
See this for a test, run on ODA:
data big;
set sashelp.cars;
do i = 1 to 10000;
output;
end;
run;
data _null_;
set big;
call symput("a",_n_);
run;
proc sql noprint;
select nobs into :a from dictionary.tables
where libname ="WORK" and memname = "BIG";
quit;
Log excerpts:
76 data _null_; 77 set big; 78 call symput("a",_n_); 79 run; NOTE: Numeric values have been converted to character values at the places given by: (Line):(Column). 78:17 NOTE: There were 4280000 observations read from the data set WORK.BIG. NOTE: Verwendet wurde: DATA statement - (Gesamtverarbeitungszeit): real time 1.44 seconds user cpu time 1.29 seconds 81 proc sql noprint; 82 select nobs into :a from dictionary.tables 83 where libname ="WORK" and memname = "BIG"; 84 quit; NOTE: Verwendet wurde: PROZEDUR SQL - (Gesamtverarbeitungszeit): real time 0.00 seconds user cpu time 0.01 seconds
But that's just one little and IMO unnecessary part, as you're much better off joining the tables and running the calculations in one step with BY processing. But for that we need to know what you actually do in the innermost loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @ballardw @Tom @Kurt_Bremser
Please find the complete code for your evaluation...
%MACRO TEST;
LIBNAME test 'path/location/';
DATA DS1;
SET test.inp_1;
RUN;
DATA DS2;
SET tes.inp_2;
RUN;
PROC SORT DATA=DS1;
By var;
run;
PROC SORT DATA=DS2;
By var;
run;
DATA ds3;
MERGE ds1(in=a) ds2(in=b);
by var;
if a and b;
RUN;
PROC SORT DATA=ds3 nodupkey out=ds4;
By var2;
RUN;
DATA _null_;
SET ds4;
CALL SYMPUT('var3',compress(_n_));
RUN;
%DO j=1 %TO &var3;**********50 times;
DATA _null_;
set ds3(firstobs=&j obs=&j);
CALL SYMPUT('var4',compress(var4));
CALL SYMPUT('var5',compress(var5));
RUN;
DATA ds5;
SET ds3;
WHERE var4="var4" and var5="var5";
RUN;
PROC SORT DATA=ds5 nodupkey out=ds6;
BY var4 var5 var6;
RUN;
DATA _null_;
SET ds6;
CALL SYMPUT('var7',compress(_n_));
RUN;
%DO i=1 %TO &var7;***************100 times;
DATA _null_;
SET ds6;
CALL SYMPUT('var4',compress(var4));
CALL SYMPUT('var5',compress(var5));
CALL SYMPUT('var8',compress(var8));
RUN;
DATA ds7;
set ds5;
WHERE var4="&var4" and var5="&var5" and var8="&var5";
RUN;
DATA ds7;
set ds7;
do i=1 to count;
output;
end;
if a > . then
call symput ('a',compress(a));
if b > . then
call symput ('b',compress(b));
RUN;
PROC SUMMARY data=ds7;
var date;
output out=ds8 min=mindate max=maxdate;
RUN;
******
ODS FOR GRAPH
PROC CAPABILITY
PROC APPEND;
******;
%END;
%END;
%MEND;
%TEST;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
PROC SUMMARY data=ds7; var date; output out=ds8 min=mindate max=maxdate; RUN; ****** ODS FOR GRAPH PROC CAPABILITY
Look into BY group processing here before engaging in complicated and inefficient macro programming.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Your code might be more legible if you pasted it using the appropriate icon.
As it is, it is hard to understand what the purpose of this incredible succession of data steps and macro variable assignment is.
Maybe you just need 2 steps?
1. Run proc sort to remove dupkeys in DS3 by VAR4 VAR5 VAR8
2. Run proc summary by VAR4 VAR5 VAR8?