BookmarkSubscribeRSS Feed
SasStatistics
Pyrite | Level 9

I have the following simulated data which I create: 

%macro monteCarloSimulation();

	%let covariates=300; /* Number of covariates (independent variables) */

	%do mcno=1 %to 100;   /* Number of simulated datasets = 100 */
		data logit_data;
		drop i j;
		array x{&covariates.} x1-x&covariates.;
		do i=1 to 1000;
		do j=1 to &covariates.;
		x{j}=ranuni(1);
		end;
		linpred=2+10*x17-8*x5+3*x2+7*x6-5*x3-12*x30+11*x130-12*x200+rand("NORMal");
		prob = exp(linpred)/ (1 + exp(linpred));
		y = (prob > 0.5);
		output;
		end;
		drop prob linpred;
		run;

		/* Here I would like to run stepwise forward regression
		and stepwise backward regression and store the corresponding AIC 
		values to produce the table referenced below. 
		This should be done for each table that i produce in the simulation 
	        Note that 100 simulated tables are produced    */


	%end;

%mend monteCarloSimulation;

%monteCarloSimulation() 

From that simulated data, I would like for each simulated dataset to calculate: 
- AIC from a stepwise forward regression. 
- AIC from a stepwise backward regression. 

- If possible (I will read up on this later) AIC from a Lasso regression. 

And then finally store the AIC values in a table of the format: 

  AIC_Forward_Stepwise_Regression AIC_Backward_Stepwise_Regression AIC_Lasso
Simulation1       
SImulation2      
.      
.      
.      
Simulation100      


Ideally, I would also like to finally produce some summary statistics for evaluating which model-selection scheme performs best: 

  Forward_Stepwise_Regression Backward_Stepwise_Regression Lasso
Mean AIC      
STD       
Median AIC      
25% quantile      
75% quantile       


This would be easily done in other programming languages and I guess so in SAS aswell, but are not used to doing statistical analysis in SAS (yet). 

All help appreciated. 

5 REPLIES 5
PaigeMiller
Diamond | Level 26

It's not  clear to me what part of this process you are struggling with. Is it running regressions where you have the problem, or storing the AIC values, or creating the final table, or something else?

--
Paige Miller
SasStatistics
Pyrite | Level 9
1. Running regression.
2. Store the AIC values.
3. Creating the final table.

I am very unused to this in SAS.
PaigeMiller
Diamond | Level 26

Step 1 in any macro writing process is to write working code with no macros and no macro variables, for one iteration. That's where you start. Show us that code that does stepwise regression on one iteration.

--
Paige Miller
PaigeMiller
Diamond | Level 26

In addition to my above comments, @Rick_SAS has written blogs about performing thousands of regressions, and no macros are needed. It's highly likely that this could be adapted to your Monte Carlo case (and again no macros needed). Or maybe even he has created a similar blog post for Monte Carlo simulations, but I'm sure there is no need for macros here.

 

Taking a further step back: I understand that the primary reason people run Monte Carlo simulations is to obtain estimates of variability for estimators that don't have a closed form formula for the variability of the estimator. In your case, you seem to be doing a Monte Carlo simulation for situations where you have 300 covariates which are uncorrelated with each other. This corresponds to exactly zero real-world data sets — you will never find a real-word data set where the covariates are uncorrelated (or even slightly correlated). Every real world data set I know of has certain correlations that are not close to zero, and some that are close to (or exactly equal to) ±1. So I question the value of such a Monte Carlo study; a more valuable study would be the case where the covariates have many correlations that are not near zero and possibly some that are near ±1. So my advice is to not do this particular Monte Carlo study as you have it set up, unless it is a homework assignment.

--
Paige Miller
Rick_SAS
SAS Super FREQ

The basic outline for this kind of simulation follows:

1. If you know how to use the DATA step to simulate one sample of size N from a logistic model, then put a DO loop around the outside so that you generate B samples, each of size N.

For an example of a linear model, see "Simulate many samples from a linear regression model." For a logistic model, see the ideas in this post, although the actual simulation in that post uses PROC IML.

2. Turn off ODS and use a BY-group analysis to analyze all B samples by using one call to a procedure.

3. Use PROC MEANS or UNIVARIATE to analyze the distribution of the statistic (such as AIC) that you are studying.

 

I would like to point out that your simulation from a logistic model is not correct. You put the "randomness" in the wrong location. Instead of

linpred = <linear combination> + rand("NORMal");
prob = exp(linpred)/ (1 + exp(linpred));
y = (prob > 0.5);

 the correct formula is 

linpred = <linear combination>;     /* 2. linear model */
mu = logistic(eta);                 /* 3. transform by inverse logit */
y = rand("Bernoulli", mu);          /* 4. Simulate binary response */

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 455 views
  • 1 like
  • 3 in conversation