DATA Step, Macro, Functions and more

data step: generating many subsets

Accepted Solution Solved
Reply
Contributor
Posts: 34
Accepted Solution

data step: generating many subsets

Hi,

I am from an input file using a do loop in a data step.  (please see the  partial code below). In the test code I am asking for 30 000 subsets but SAS generates only about 22 000 than stops. What could be a reason for this? To complete my task I will need to generate 250 000 subsets from an input file. How I can do this? Should I try  proc SQL instead of a data step?  Thank you!

The file in set instruction is more than 1 Gb and the generated subsets are only 17Kb.

Data %do i=1 %to 30000; inputdata&i end;;

set inputdata;

%do i=1 %to 30000;

If BSTR=&i then output input&i;

%end;

run;


Accepted Solutions
Solution
‎02-15-2012 02:03 PM
Super User
Posts: 5,085

data step: generating many subsets

Lida,

While all this is likely to work, you will still encounter similar problems in the bottom half of your program.  You will encounter resource issues when you try to merge 250,000 data sets.

A better approach would collect the output data sets vertically instead of horizontally.  For example:

data lifetable;

   set &output_table._&i (keep=sex agegroup ex exa);

   /* Extra, but possibly not needed:  BSTR = &i;  */

run;

proc append data=lifetable base=&output_data;

run;

/* PROC SQL as above, to drop &output_table._&i  */

%end;  /* All loop iterations are complete */

proc summary data=&output_data nway;

   var ex exa;

   class sex agegroup;

   output out=&output_data (drop=_freq_ _type_)

             mean = LEmean HALEmean

             var = LEvar HALEvar;

run;

It would be up to you if you wanted to replace &output_data with the summarized version, or if you wanted to keep all the detail from each bootstrap sample and use a separate data set to hold the summary.  But don't try to merge 250,000 data sets and be wary of creating a data set with 1M variables when you have an alternative.

It looks like you already have PROC SQL dropping any earlier version of &output_data, which is a good thing to do.

Good luck.

View solution in original post


All Replies
PROC Star
Posts: 7,363

data step: generating many subsets

You will have to post more of your code and the part of the log showing the error (if any) you are receiving would help.

It appears that your code is embedded in a macro but, as posted, wouldn't run.

I think more info is needed before anyone can help.

Respected Advisor
Posts: 3,124

data step: generating many subsets

You would need to share more code to the boundary of your Macro definition. %do does not work this way (open code).

Regards,

Haikuo

Contributor
Posts: 34

data step: generating many subsets

Yes, you are right: this is a partial code from a macro defenition.

%macro test (inputdata=,b=);

data %do i=1 %to &b; inputdata&i end;;

set &inputdata;

%do i=1 %to 30000;

If BSTR=&i then output input&i;

%end;

run;

/* some other calculations*/

%mend test;

%test (inputdata=mydata,b=30000);

I cannot show you the log now because I am running a program now and saving a log into a file. However, I can see through the windows explorer that the macro generated only about 22 000 subsets out of 30000. It continues to run using the generated subsets to perform other calculations.

Thanks.

Super User
Posts: 5,085

data step: generating many subsets

There could easily be a hardware limitation.  SAS may not have the resources to track 30,000 data sets in a single data step.  You may be forced to run the program several times, creating 20,000 data sets each time and using different ranges for &I each time.

As a secondary suggestion, I would add this line inside your second %DO loop:

%if &i > 1 %then else;

It will turn your set of 30,000 IF/THEN statements into IF/THEN/ELSE statements.  (Of course, if &I ranges from 20001 to 40000, then the comparison becomes %if &i > 20001 %then elseSmiley Wink

Good luck.

Contributor
Posts: 34

data step: generating many subsets

Yes, my plan B is to run this by parts. But I am still hoping to find a way to run it only one time.

I will check your other suggestion. thanks!

Super User
Posts: 5,085

data step: generating many subsets

I would be interested in seeing how much difference that makes in the total run time.

The other posts are valid, though.  If you can post what is involved in the "some other calculations" section of your macro, there may be another way to approach the problem.

Are we same in assuming that the first "end;" should be "%end;"

Good luck.

Super User
Super User
Posts: 6,502

Re: data step: generating many subsets

You should explain why you feel a need to make 250,000 individual data sets. 

Why not just use BSTR as a BY variable in whatever processing you are doing?

I doubt that SAS would be able to generate 30,000 datasets in a single pass this way.

If you move the loop to outside the data step then it could work.  Create an index on BSTR for the input dataset to make it efficient.

%do i=1 %to 30000;

data inputdate&i;

   set inputdata;

   where BSTR=&i ;

run;

%end;

Contributor
Posts: 34

data step: generating many subsets

Well, in this case I will read the file (1.5 Gb) in the set instruction 30000 times. Correct? I am trying to minimize I/O.

Respected Advisor
Posts: 3,124

data step: generating many subsets

Well, if you want to stick to your original design, I think you already have a pretty good idea on the up-limit to your hardware. So you can definitely use that info to break up your one task into many and finish it up.

I am also curious about the reason you need so many tables. if you don't mind sharing the whole picture, maybe someone on the forum can figure out an alternative way to deal with it without producing 30,000 tables.

Regards,

Haikuo

Contributor
Posts: 34

data step: generating many subsets

I am calculating a variance for life expectancy using the bootstrap methodology. BSTR is a bootstrap number. Later in the program I include another macro which builds a life table. That macro does not process data by By variable. Therefore I need to subset the original dataset and input the data into the submacro. At the end of the program I collect all 250000 values for life expectancy and calculate variance. The alternative way is to modify a submacro by adding a by instruction. I am trying to avoid this.

When my test code will stop running, I will send you the whole program. May be someone in the forum can suggest the alternative way.

Thank you!

Contributor
Posts: 34

data step: generating many subsets

OK. This is what I am doing:

I am going to try Index this time...

Any other ideas? Any input is greatly appriciated. Thank you for your help!!

**************************************************;

* macro for LT calculations;

***************************************************;

%macro LE_HALE_estimation (input_data=,

b=, output_data=, mortrates=, HUI=);

proc sort data=&input_data;

by BSRP;

run;

%include '…\LifeTable20Plus_rates.sas';

PROC SQL;

     DROP TABLE &output_data;

   QUIT;

data %do i=1 %to &b; inputdata&i %end;;

set &input_data;

%do i=1 %to &b;

if BSRP=&i then output inputdata&i;

%end;

run;

%do i=1 %to &b;

%LifeTable20Plus_rates(

            input_mortality=inputdata&i,

        input_population=inputdata&i,

            output_liftable=&output_data._&i,

            chiang_ax=Chiang20PlusConstantsBySex (WHERE=(Sex IN ('F','M'))),

            age_group=AgeGroup,

            sex=Sex,

            mortrates=&mortrates,

            HUI=&HUI,

            print=);

data lifetable&i;

set &output_data._&i (keep=sex agegroup ex exa);

rename ex=ex&i exa=exa&i;

run;

proc sort data=lifetable&i;

by sex agegroup;

run;

PROC SQL;

     DROP TABLE &output_data._&i;

   QUIT;

%end;

data &output_data;

merge lifetable1-lifetable&b;

by sex agegroup;

run;

PROC DATASETS LIBRARY=WORK NOLIST NOWARN;

   DELETE lifetable1-lifetable&b;

quit;

data &output_data;

set &output_data;

LEmean=mean(of ex1-ex&b);

LEvar=var(of ex1-ex&b);

HALEmean=mean(of exa1-exa&b);

HALEvar=var(of exa1-exa&b);

run;

%mend LE_HALE_estimation;

/*generate b number of LTs*/

%LE_HALE_estimation (input_data=new.resultsForVar,

b=25000, output_data=new.LT_NormalWgt,mortrates=ratenormal,HUI=HUInormal);

Wneh I run this code for b=25---noproblem.

when b=25000 or 30000. I have an error; (I can see my log now)

ERROR: The SAS System stopped processing this step because of insufficient memory.
WARNING: The data set WORK.INPUTDATA1 may be incomplete.  When this step was stopped there were 0 observations and 28 variables.

_________________________________________________

Super User
Posts: 17,842

data step: generating many subsets

A suggestion:

Why not change the order of operations in your macro so that you create your subset, run the macro code and then do your other step rather than create and store all 30,000 in one go.

Another option is you've used a where clause in setting your data with the sex=M/F why not use that type of logic for your dataset with a where clause with a loop in a diff location.  Indexing the file will allow the where clause to be fairly efficient.

  input_mortality=&inputdata (where=BSRP=&i)),

        input_population=inputdata&i,

            output_liftable=&output_data._&i,

            chiang_ax=Chiang20PlusConstantsBySex (WHERE=(Sex IN ('F','M'))),

            age_group=AgeGroup,

            sex=Sex,

            mortrates=&mortrates,

            HUI=&HUI,

            print=

Contributor
Posts: 34

data step: generating many subsets

It could work. Thanks. will try.

Solution
‎02-15-2012 02:03 PM
Super User
Posts: 5,085

data step: generating many subsets

Lida,

While all this is likely to work, you will still encounter similar problems in the bottom half of your program.  You will encounter resource issues when you try to merge 250,000 data sets.

A better approach would collect the output data sets vertically instead of horizontally.  For example:

data lifetable;

   set &output_table._&i (keep=sex agegroup ex exa);

   /* Extra, but possibly not needed:  BSTR = &i;  */

run;

proc append data=lifetable base=&output_data;

run;

/* PROC SQL as above, to drop &output_table._&i  */

%end;  /* All loop iterations are complete */

proc summary data=&output_data nway;

   var ex exa;

   class sex agegroup;

   output out=&output_data (drop=_freq_ _type_)

             mean = LEmean HALEmean

             var = LEvar HALEvar;

run;

It would be up to you if you wanted to replace &output_data with the summarized version, or if you wanted to keep all the detail from each bootstrap sample and use a separate data set to hold the summary.  But don't try to merge 250,000 data sets and be wary of creating a data set with 1M variables when you have an alternative.

It looks like you already have PROC SQL dropping any earlier version of &output_data, which is a good thing to do.

Good luck.

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 20 replies
  • 372 views
  • 6 likes
  • 6 in conversation