I have the following simulated data which I create:
%macro monteCarloSimulation(); %let covariates=300; /* Number of covariates (independent variables) */ %do mcno=1 %to 100; /* Number of simulated datasets = 100 */ data logit_data; drop i j; array x{&covariates.} x1-x&covariates.; do i=1 to 1000; do j=1 to &covariates.; x{j}=ranuni(1); end; linpred=2+10*x17-8*x5+3*x2+7*x6-5*x3-12*x30+11*x130-12*x200+rand("NORMal"); prob = exp(linpred)/ (1 + exp(linpred)); y = (prob > 0.5); output; end; drop prob linpred; run; /* Here I would like to run stepwise forward regression and stepwise backward regression and store the corresponding AIC values to produce the table referenced below. This should be done for each table that i produce in the simulation Note that 100 simulated tables are produced */ %end; %mend monteCarloSimulation; %monteCarloSimulation()
From that simulated data, I would like for each simulated dataset to calculate:
- AIC from a stepwise forward regression.
- AIC from a stepwise backward regression.
- If possible (I will read up on this later) AIC from a Lasso regression.
And then finally store the AIC values in a table of the format:
AIC_Forward_Stepwise_Regression | AIC_Backward_Stepwise_Regression | AIC_Lasso | |
Simulation1 | |||
SImulation2 | |||
. | |||
. | |||
. | |||
Simulation100 |
Ideally, I would also like to finally produce some summary statistics for evaluating which model-selection scheme performs best:
Forward_Stepwise_Regression | Backward_Stepwise_Regression | Lasso | |
Mean AIC | |||
STD | |||
Median AIC | |||
25% quantile | |||
75% quantile |
This would be easily done in other programming languages and I guess so in SAS aswell, but are not used to doing statistical analysis in SAS (yet).
All help appreciated.
It's not clear to me what part of this process you are struggling with. Is it running regressions where you have the problem, or storing the AIC values, or creating the final table, or something else?
Step 1 in any macro writing process is to write working code with no macros and no macro variables, for one iteration. That's where you start. Show us that code that does stepwise regression on one iteration.
In addition to my above comments, @Rick_SAS has written blogs about performing thousands of regressions, and no macros are needed. It's highly likely that this could be adapted to your Monte Carlo case (and again no macros needed). Or maybe even he has created a similar blog post for Monte Carlo simulations, but I'm sure there is no need for macros here.
Taking a further step back: I understand that the primary reason people run Monte Carlo simulations is to obtain estimates of variability for estimators that don't have a closed form formula for the variability of the estimator. In your case, you seem to be doing a Monte Carlo simulation for situations where you have 300 covariates which are uncorrelated with each other. This corresponds to exactly zero real-world data sets — you will never find a real-word data set where the covariates are uncorrelated (or even slightly correlated). Every real world data set I know of has certain correlations that are not close to zero, and some that are close to (or exactly equal to) ±1. So I question the value of such a Monte Carlo study; a more valuable study would be the case where the covariates have many correlations that are not near zero and possibly some that are near ±1. So my advice is to not do this particular Monte Carlo study as you have it set up, unless it is a homework assignment.
The basic outline for this kind of simulation follows:
1. If you know how to use the DATA step to simulate one sample of size N from a logistic model, then put a DO loop around the outside so that you generate B samples, each of size N.
For an example of a linear model, see "Simulate many samples from a linear regression model." For a logistic model, see the ideas in this post, although the actual simulation in that post uses PROC IML.
2. Turn off ODS and use a BY-group analysis to analyze all B samples by using one call to a procedure.
3. Use PROC MEANS or UNIVARIATE to analyze the distribution of the statistic (such as AIC) that you are studying.
I would like to point out that your simulation from a logistic model is not correct. You put the "randomness" in the wrong location. Instead of
linpred = <linear combination> + rand("NORMal"); prob = exp(linpred)/ (1 + exp(linpred)); y = (prob > 0.5);
the correct formula is
linpred = <linear combination>; /* 2. linear model */ mu = logistic(eta); /* 3. transform by inverse logit */ y = rand("Bernoulli", mu); /* 4. Simulate binary response */
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.