BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
AllSoEasy
Obsidian | Level 7

Hello all,

I have a tricky task to inquire about:

I have a regression model with 14 variables, and for various products I need to run every possible form of the model -- each possible form being every possible permutation of the model variables (because some products will be better fit by a model with some subset of the 14 variables). I obtain parameter estimates on the model using "proc model," my issue now is that I essentially need to have/execute 14! (87,178,291,200) different proc model statements. Obviously this will be impossible for me to manually code by hand. I was wondering if there is some possible way to use macro loops to dynamically build the model that needs to be run, and then loop through all combinations. For example, if I have a macro function for each variable in the model, named accordingly, and then within the macro loops I build out the current model to be run by essentially concatenating the names of the macro functions according to what iteration of the loop (it would be 14 nested loops) I am in, and then dynamically run the proc model statement for the respective model.

I hope I explained this well enough, please let me know if I need to better explain what it is that I am trying to do. If you understand what I am trying to do then please let me know if you have any ideas/suggestions at all, or if you believe that is just not within the capabilities of SAS code.

Thank you all!

-Ryan

1 ACCEPTED SOLUTION

Accepted Solutions
Vince28_Statcan
Quartz | Level 8


Hi AllSoEasy,

I should probably have provided tips rather than hard code it for you but anyway the program I joined should do what you want. You will need to remove the (obs=100) as a dataset option in the proc sql; (I needed to test if output was correct) and use %workdone instead of %workdone2(). Workdone2 macro was only adapted to output what is parsed by workdone to see that there was no syntax error. I hope I didn't miss out on special cases with + symbols when all variables in a series are not in the model.

If workdone() crashes, I recommend using workdone2() to generate a file with the entire code generated by the macro and then to break it down in a few jobs (10, 50, 100 maybe?) and just run each job instead of running all 2^14 proc models at once.

Vincent

*edit in case its not obvious with the comments in the code. the parms_#_############## datasets output by the macro allow you to track what variables were in the model. The first # indicates the number of variables in the model. Then the last 14 # represent wheter variables are included (1) or not (0) in the order that they appeared in your example (so B1,...B8,A1,...A6)

View solution in original post

22 REPLIES 22
Reeza
Super User

Can you show your code for one model, with the output captured that you'd need for that model? For example, are you interested in the parameters plus the p-values or any model fit statistics?

I don't recommend trying to loop through all 87+million output Smiley Happy.

Once you have the single model built, post it here and we can help you turn it into a macro. It's relatively straightforwards, but it will take a while to run.

AllSoEasy
Obsidian | Level 7

Reeza,

This is my proc model statement for the full model form with all variables (some variables don't need to be permuted i.e. B9, Gamma, A0 are always in the model):

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. =B1*dif_10YrTr_up + B2*dif_10YrTr_down +  B3*lag_dif_10YrTr_up + B4*lag_dif_10YrTr_down +

  B5*dif_3MoTr_up + B6*dif_3MoTr_down + B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate. +

  Gamma*(A0 + A1*lag_3MoTr + A2*lag_10YrTr + A3*(lag_3MoTr*lag_3MoTr) + A4*(lag_10YrTr*lag_10YrTr) +

  A5*(lag_3MoTr*lag_3MoTr*lag_3MoTr) + A6*(lag_10YrTr*lag_10YrTr*lag_10YrTr) - lag_&term_rate.);

  fit dif_&term_rate. / dw=1 collin white;

run;

It's already within a macro where I just hard-coded some of the most significant model forms, "&term_rate." is the macro function parameter which is the variable being modeled in each pass through of the model forms.

B1,B2,B3,B4,B5,B6,B7.B8.A1,A2,A3,A4,A5,A6  are the coefficients I need permute each combination of, and the only output I require are the parameter estimates,p-values, R-square/Adj R-square, RMSE, and the value for white's heteroscedasticity test if possible.

And yes, I expect it to take a very long time to run, but this is fine for my purposes. As long as it runs in under 24 hours it's fine, even a bit longer won't hurt.

I look forward to seeing your suggestion. Thank you!

Vince28_Statcan
Quartz | Level 8

hold on, when you mean permuting variables, you mean either they are in or they are out but there isn't a difference in the output from proc model if you just swap the order the variables show up right? I'm sorry if this is an idiotic question, I haven't had to use proc model yet but I'm definitely comfortable with programming the macro regardless.

The reason why I ask is obviously that you have significantly fewer than 14! cases to do if it's either in or out. you go down to 2^14 and this shouldn't be an issue for computing time

Vince28_Statcan
Quartz | Level 8

It is possible to program via macros but unless you have a huge server dedicated or an extremely small model (like where your basic single occurence takes sub 1s), it won't ever end. You would need to give us at very least your basic proc model statement with the 14 variable names that you wish to swap around and an example of how you want the output to be saved.

AllSoEasy
Obsidian | Level 7

Sorry if I wasn't clear enough about the output, as long as I can grab the results from the outparms= dataset, and at least the R-square value and as I go through the macros (or even after the fact as long as the datasets for each model are uniquely named) append the results to a master dataset where each observation of the dataset is the parameters estimates and the r-square value

i.e. the master dataset may have column headers such as:

b1, b2, b3, b4, b5, b6, b7, b8, b9, gamma, a0, a1, a2, a3, a4, a5, a6, R-square

with a row containing the values for each model permutation (the model forms that don't include certain coefficients will just have that coefficient with a value of '0' in the file).

Does this make sense? Forgive me if I am not being clear.

Thanks !!

Reeza
Super User

I don't understand how you'd want to permutate the model, mostly because I don't understand proc model, so showing a few variations of the model would help.

You really don't want to keep that many datasets hanging around, its better to build your table as part of the process, so make sure you know what output you want to capture ahead of time.

In fact, usually, I'd turn off the ODS destination so no output was generated because that would take time and resources as well.

Will this info be sufficient?

Model b1, b2, b3, b4, b5, b6, b7, b8, b9, gamma, a0, a1, a2, a3, a4, a5, a6, R-square

AllSoEasy
Obsidian | Level 7

@Vince28: Sorry, I mispoke I suppose, you are correct, the order of the variables does not matter, by permutations I really meant including the variable or not, which is much less than 14!, but is still far too many to manually code by hand.

-As for the model forms, basically written out my full model is this:

term_equation.png

"10YrTr" and "3MoTr" being U.S. treasury interest rates as my two independent variables, and superscript +/- is for a positive change verse a negative change in these variables from the previous obseration (t-1) to current (t). Time "t-1" represents a lag of one month, time "t" represents the current month. "R" is the dependent Y variable which could be a number of interest rates that I amt rying to model.

Permutations include removing some of these terms that may not reflect behaviors significant for modeling a certain rate.

Variations of the model would be like for example:

/*ALL Variables: */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. =B1*dif_10YrTr_up + B2*dif_10YrTr_down +  B3*lag_dif_10YrTr_up + B4*lag_dif_10YrTr_down +

  B5*dif_3MoTr_up + B6*dif_3MoTr_down + B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate. +

  Gamma*(A0 + A1*lag_3MoTr + A2*lag_10YrTr + A3*(lag_3MoTr*lag_3MoTr) + A4*(lag_10YrTr*lag_10YrTr) +

  A5*(lag_3MoTr*lag_3MoTr*lag_3MoTr) + A6*(lag_10YrTr*lag_10YrTr*lag_10YrTr) - lag_&term_rate.);

  fit dif_&term_rate. / dw=1 collin white;

run;

/*No Instantaneous 10YrTr: */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. = B3*lag_dif_10YrTr_up + B4*lag_dif_10YrTr_down +

  B5*dif_3MoTr_up + B6*dif_3MoTr_down + B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate. +

  Gamma*(A0 + A1*lag_3MoTr + A2*lag_10YrTr + A3*(lag_3MoTr*lag_3MoTr) + A4*(lag_10YrTr*lag_10YrTr) +

  A5*(lag_3MoTr*lag_3MoTr*lag_3MoTr) + A6*(lag_10YrTr*lag_10YrTr*lag_10YrTr) - lag_&term_rate.);

  fit dif_&term_rate. / dw=1 collin white;

run;

/*No Instantaneous 3MoTr */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. = B3*lag_dif_10YrTr_up + B4*lag_dif_10YrTr_down +

  B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate. +

  Gamma*(A0 + A1*lag_3MoTr + A2*lag_10YrTr + A3*(lag_3MoTr*lag_3MoTr) + A4*(lag_10YrTr*lag_10YrTr) +

  A5*(lag_3MoTr*lag_3MoTr*lag_3MoTr) + A6*(lag_10YrTr*lag_10YrTr*lag_10YrTr) - lag_&term_rate.);

  fit dif_&term_rate. / dw=1 collin white;

run;

/*No Instantaneous or lagged 10YrTr */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. = B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate. +

  Gamma*(A0 + A1*lag_3MoTr + A2*lag_10YrTr + A3*(lag_3MoTr*lag_3MoTr) + A4*(lag_10YrTr*lag_10YrTr) +

  A5*(lag_3MoTr*lag_3MoTr*lag_3MoTr) + A6*(lag_10YrTr*lag_10YrTr*lag_10YrTr) - lag_&term_rate.);

  fit dif_&term_rate. / dw=1 collin white;

run;

/*No Instantaneous or lagged 10YrTr or mean reversion */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. = B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate.;

  fit dif_&term_rate. / dw=1 collin white;

run;

/*No Instantaneous 3MoTr or mean reversion */

proc model data=to_forecast_parms outparms=parms_all_Asymm;

  dif_&term_rate. = B3*lag_dif_10YrTr_up + B4*lag_dif_10YrTr_down +

  B7*lag_dif_3MoTr_up + B8*lag_dif_3MoTr_down + B9*lag_dif_&term_rate.;

  fit dif_&term_rate. / dw=1 collin white;

run;

___________________________________________________________________________

These are just a couple of examples, I am just excluding some behaviors in each of the forms that may  not affect a particular rate that I am trying to model.

@Reeza: "

Will this info be sufficient?

Model b1, b2, b3, b4, b5, b6, b7, b8, b9, gamma, a0, a1, a2, a3, a4, a5, a6, R-square"

-- I would like to add Adj-R-square, RMSE, and white's value for heterscedasicity test if possible - BUT the above is the minimum requirement, and should be sufficient if it is significantly harder to add the others.

I do agree with create the master dataset of all model estimates/statistics as the process goes, rather than keeping the datasets hanging around, anything I can do to optimize is beneficial.

Please let me know if I can provide any more information on what I am attempting.

AllSoEasy
Obsidian | Level 7

Also, notice how in my example proc model statement all of the "outparms=" statements have the same dataset name "parms_all_Asymm," -- this is just because I was typing the examples here on the forum rather than copy pasting all of my code; of course in the actual code the dataset names are unique and indicative of the model that they come from.


Thanks!

Vince28_Statcan
Quartz | Level 8


Hi AllSoEasy,

I should probably have provided tips rather than hard code it for you but anyway the program I joined should do what you want. You will need to remove the (obs=100) as a dataset option in the proc sql; (I needed to test if output was correct) and use %workdone instead of %workdone2(). Workdone2 macro was only adapted to output what is parsed by workdone to see that there was no syntax error. I hope I didn't miss out on special cases with + symbols when all variables in a series are not in the model.

If workdone() crashes, I recommend using workdone2() to generate a file with the entire code generated by the macro and then to break it down in a few jobs (10, 50, 100 maybe?) and just run each job instead of running all 2^14 proc models at once.

Vincent

*edit in case its not obvious with the comments in the code. the parms_#_############## datasets output by the macro allow you to track what variables were in the model. The first # indicates the number of variables in the model. Then the last 14 # represent wheter variables are included (1) or not (0) in the order that they appeared in your example (so B1,...B8,A1,...A6)

AllSoEasy
Obsidian | Level 7

Wow that's incredible. I didn't expect to be provided with working code like that! Many, many thanks to you Vince28, I really can' tell you how much I appreciate it. I am going to go over the code for a bit and try to understand everything you have done, and begin testing it with my data. I will be sure to post back if I have any questions.

But again this is such a huge help, I can't tell you how much I appreciate your efforts!!

AllSoEasy
Obsidian | Level 7

Hello again,

I'm having another issue -- this issue is in regards to this process being implemented by Vince28@Statcan's code (posted above), but isn't entirely isolated to this program because I've had this problem before when running processing-intensive SAS programs, especially with a great deal of nested looping.

Essentially the above code works, the piece of the code that generates all permutations of the proc model equation (2^14 permutations) runs quickly and is correct, I've verified that the permutations being executed in proc model in the "%workdone()" macro are syntactically correct. I've run portions of the program in pieces (i.e. after getting the data set of permutations, running only 100-200 of them in the model statement at a time). And everything works smoothly, however when I try to execute the loop in full (run 2^14 proc model statements), or even in batches of 1000s, my SAS program will freeze up at [seemingly] random points. The order of model permutations is the same every time, and so if it was one of the model permutations having a problem specifically, it would freeze/error out at the same point every time, and I am aggregating the output data from proc model into a SAS dataset, so I can see the last model ran every time it freezes. Sometimes it runs through 100 models, sometimes 500, sometimes 2000 before it freezes up, and it freezes on different processes (some judging off what EG says is currently running in the lower left-hand corner, sometimes it's stuck on "Running: proc datasets" or sometimes "Running: proc model" or sometimes "Running loop..." it just has no consistency in where the problem is occurring. There are no errors present in the SAS log, but sometimes SAS will display an error message that it has experienced a problem and must shut down - but not always. I know that it is frozen and not just processing because I can see the aggregated results file going from constantly increasing in size every second to being unchanged for hours, also I've let the program run for up 24 hours, and still be stuck at where it was at 10 minutes into execution.

As I stated my code is essentially Vince28's above posted code, however I added a bit of logic to aggregate the datasets while processing, and then using proc datasets statements to delete the temporary datasets to keep memory clean because ~60,000 temporary datasets would be created throughout the entire run. My modified code is still as follows:

/* step3 use all of this to throw it back into a series of proc model */

%macro workdone();

  %do i=1 %to &loopend;

  /* one proc per record */

  proc model data=to_forecast_parms outparms=parms_&&ind1_&i.._&&ind2_&i..;

  ods output "Nonlinear OLS Summary of Residual Errors" = stat_&&ind1_&i.._&&ind2_&i..;

  dif_&term_rate. = &&bvars&i B9*lag_dif_&term_rate. + gamma*(A0 &&avars&i - lag_&term_rate.) ;

  fit dif_&term_rate. / dw=1;

  run;

  data merge_data;

  merge stat_&&ind1_&i.._&&ind2_&i.. parms_&&ind1_&i.._&&ind2_&i..;

  Equation="dif_&term_rate. = &&bvars&i B9*lag_dif_&term_rate. + gamma*(A0 &&avars&i - lag_&term_rate.)";

  run;

  proc append base=forms.all_model_forms_data data=merge_data force; /*forms is a libname where my files are located */

  run;

  proc datasets gennum=all;

    delete stat_&&ind1_&i.._&&ind2_&i..;

  run;

  proc datasets gennum=all;

    delete parms_&&ind1_&i.._&&ind2_&i..;

  run;

  proc datasets gennum=all;

    delete merge_data;

  run;

  %end;

%mend;

Again, when running in batches I can execute any of the models at around 100 at a time at most, before I start to run into these freezing errors, there is no consistency on what model the error is occurring on, and there doesn't seem to be errors in the code. If anyone could help me out I would greatly appreciate it because I am truly at a loss..

Thanks!!

-Ryan

Astounding
PROC Star

It's anybody's guess as to where the resources are leaking ... so here is what I would try.

Get rid of all three PROC DATASETS.  Instead, just re-use the same data set name:

proc optmodel outparms=parms_out;

ods output ... = stats_out;

merge parms_out stats_out;

Even if it doesn't solve the bottleneck, the program will become easier to read.  It's not like these are permanent data sets where you need to distinguish one name from another.

Good luck.

Good luck.

Vince28_Statcan
Quartz | Level 8

Hi Ryan,

I truly don't know much about proc model but I would suspect the apparent lack of memory issue comes from one of two things:

Either your ODS destination is full from the line (or similarly out of log space and you'd need to use an option to put log in an external text file instead)

ods output "Nonlinear OLS Summary of Residual Errors" = stat_&&ind1_&i.._&&ind2_&i..;

If you did not specify ods listing close; prior to running the program, this might explain it.

Else, and this is why I say I don't know much about proc model, some functions generate catalog entries or other forms of behind the scenes metadata datasets that would pile up in your work environment and cause the out of memory issue.

Or at least that's all that comes to mind.

Good luck!

Vincent

AllSoEasy
Obsidian | Level 7

Vince28 & Astounding - Thank you both very much for your suggestions. I have attempted everything mentioned above and have been able to notice much of a difference --  still freezing. Again I appreciate your help a great deal, you've been more than helpful. I'm just so confused as to what SAS could be doing - my computer's processor is not even breaking a sweat, and there are over 2 GB's of RAM free, and SAS is just frozen, unable to continue executing, with no errors or any known reason for it ceasing to execute...I just don't understand it Smiley Sad

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 2637 views
  • 10 likes
  • 4 in conversation