BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
john1111
Obsidian | Level 7

I was reading a paper on Bootstrapping with models for count data. (Page 1170) www.researchgate.net/publication/51738951_Bootstrapping_with_Models_for_Count_Data

 

The author then says that...
...data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 − p.

 

I am trying to reproduce this so the first thing I did was to run the model to produce the equation 4 above. (Data attached)

 

proc genmod data = WORK.swim;
  class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
  model Infections = Swimmer Location Age Sex / dist=poisson;
run;

 

 

Using the coefficients, I'm then trying to simulate the data similar to the one observed, but I cant figure how to simulate the independent variables especially because I can't determine their distribution from the observed data attached. 

 

 

 

 

 

%let N = 287;        
%let nCont = 5;    
 
data SimReg1(keep= Y x:);
call streaminit(54321);              
array x[&nCont];        
 
array beta[0:&nCont] _temporary_ (1.0234 -0.6115 -0.5345 -0.3744 -0.1897 -0.0899);
      
 
do i = 1 to &N; 
      /*   x[1] = ...
	 x[2] = ...               /*How do I distribute the independent variables so that when I re-run*/
	 x[3] = ...               /* the new data I get the same coefficients above...*/
	 x[4] = .. */
 
   eta = beta[0];                       
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       
   end;
   lambda=exp(eta);
   Y = rand("Poisson",lambda) ;                 
   output;
end;
run;

 

How do I distribute the independent variables so that when I re-run the mode again I get the same coefficients.

 

 

 

 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model."  If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

 

 

If you have my book Simulating Data with SAS, there are examples in  Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

 

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS"  (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

View solution in original post

6 REPLIES 6
Rick_SAS
SAS Super FREQ

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model."  If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

 

 

If you have my book Simulating Data with SAS, there are examples in  Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

 

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS"  (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

john1111
Obsidian | Level 7

Just got your book and it's on point, very clear, thanks a lot. But one thing I don't understand;

 

If I fit a model using different techniques (as used in the paper stated above) i.e Bootstrapping and Normal MLE and get different standard errors for the coefficient estimators, then simulate data similar to the observed data using MLE estimators (as is done in your book).

 

The Author of the paper above uses this simulated results to compare these two techniques: 

 

"...It was found that the standard errors from the MLE analysis were on average about 5% too low, the standard errors from bootstrap resampling of cases were on average about 3% too high..."

 

My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?

Rick_SAS
SAS Super FREQ

> My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?

 

Yes. That is almost always the assumption of simulation studies. We draw many samples from (a model of) a population as a way to "repeat" the data collection scheme. We then analyze the simulated samples as if they were real data.

 

I wouldn't worry too much about the small differences (5% or 3%) between various estimates of the standard error.

I hesitate to criticize a paper that I have not studied carefully, but bear in mind that the standard error is not a property of the population (model), it is a property of the estimation method and the author cannot know the "true" value of the standard error.

 

The reference value in the paper is the Monte Carlo estimate from 1000 simulated data sets. This is the standard deviation of the 1000 regression estimates. The standard deviation is itself a statistic that depends on the variance of the data, higher-order moments, and the number of Monte Carlo simulations. So the "baseline" on which he bases his results is not truth, but is an estimate that itself can vary by several percentage points.

 

Thus a more conservative conclusion is that the parametric t-tests (which are based on an assumption of asymptotic normality of the regression estimates) produce a standard error on these data that is slightly larger than the standard error from either bootstrap method.  

john1111
Obsidian | Level 7
You are the best, thanks a lot
john1111
Obsidian | Level 7

Suppose my independent variables are categorical (How can I handle that in a Data step). Because class statement can not work in a data step.  (Commented line)

 

 

%let NumSamples = 1000;  
data RegSim(drop=eta);
call streaminit(123);
set Work.swim; 
ObsNum = _N_; 
    *class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
eta = 1.0234 - (0.1665*Swimmer)- (0.5345*Location) - (0.3744*Age) -(0.1897*Age) - (0.0899*Sex); 
do SampleID = 1 to &NumSamples;
Y = rand("Poisson",lambda);
output;
end;
run;

 

 

Rick_SAS
SAS Super FREQ

I would generate dummy variables by using one of several SAS procedures, such as PROC GLMMOD.  

 

If you feel confident in your understanding of reference parameterization, you can generate the dummy values directly in the DATA step. This is more prone to error and the correct code depends on the parameterization:

 

eta = 1.0234 - 0.1665*(Swimmer="Freq")
             - 0.5345*(Location="Beach") - ... - 0.0899*(Sex="Male");

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 3347 views
  • 8 likes
  • 2 in conversation