Solved: Re: How to simulate data from Poisson regression

john1111 · Posted 02-05-2018 06:07 AM

I was reading a paper on Bootstrapping with models for count data. (Page 1170) www.researchgate.net/publication/51738951_Bootstrapping_with_Models_for_Count_Data

The author then says that...
...data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 − p.

I am trying to reproduce this so the first thing I did was to run the model to produce the equation 4 above. (Data attached)

proc genmod data = WORK.swim;
  class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
  model Infections = Swimmer Location Age Sex / dist=poisson;
run;

Using the coefficients, I'm then trying to simulate the data similar to the one observed, but I cant figure how to simulate the independent variables especially because I can't determine their distribution from the observed data attached.

%let N = 287;        
%let nCont = 5;    
 
data SimReg1(keep= Y x:);
call streaminit(54321);              
array x[&nCont];        
 
array beta[0:&nCont] _temporary_ (1.0234 -0.6115 -0.5345 -0.3744 -0.1897 -0.0899);
      
 
do i = 1 to &N; 
      /*   x[1] = ...
	 x[2] = ...               /*How do I distribute the independent variables so that when I re-run*/
	 x[3] = ...               /* the new data I get the same coefficients above...*/
	 x[4] = .. */
 
   eta = beta[0];                       
   do j = 1 to &nCont;
      eta = eta + beta[j] * x[j];       
   end;
   lambda=exp(eta);
   Y = rand("Poisson",lambda) ;                 
   output;
end;
run;

How do I distribute the independent variables so that when I re-run the mode again I get the same coefficients.

Rick_SAS · Posted 02-05-2018 01:52 PM

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model." If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

If you have my book Simulating Data with SAS, there are examples in Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

View solution in original post

Rick_SAS · Posted 02-05-2018 01:52 PM

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model." If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

If you have my book Simulating Data with SAS, there are examples in Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

john1111 · Posted 02-06-2018 05:16 AM

Just got your book and it's on point, very clear, thanks a lot. But one thing I don't understand;

If I fit a model using different techniques (as used in the paper stated above) i.e Bootstrapping and Normal MLE and get different standard errors for the coefficient estimators, then simulate data similar to the observed data using MLE estimators (as is done in your book).

The Author of the paper above uses this simulated results to compare these two techniques:

"...It was found that the standard errors from the MLE analysis were on average about 5% too low, the standard errors from bootstrap resampling of cases were on average about 3% too high..."

My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?

Rick_SAS · Posted 02-06-2018 08:29 AM

> My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?

Yes. That is almost always the assumption of simulation studies. We draw many samples from (a model of) a population as a way to "repeat" the data collection scheme. We then analyze the simulated samples as if they were real data.

I wouldn't worry too much about the small differences (5% or 3%) between various estimates of the standard error.

I hesitate to criticize a paper that I have not studied carefully, but bear in mind that the standard error is not a property of the population (model), it is a property of the estimation method and the author cannot know the "true" value of the standard error.

The reference value in the paper is the Monte Carlo estimate from 1000 simulated data sets. This is the standard deviation of the 1000 regression estimates. The standard deviation is itself a statistic that depends on the variance of the data, higher-order moments, and the number of Monte Carlo simulations. So the "baseline" on which he bases his results is not truth, but is an estimate that itself can vary by several percentage points.

Thus a more conservative conclusion is that the parametric t-tests (which are based on an assumption of asymptotic normality of the regression estimates) produce a standard error on these data that is slightly larger than the standard error from either bootstrap method.

john1111 · Posted 02-06-2018 09:34 AM

You are the best, thanks a lot

john1111 · Posted 02-06-2018 08:09 AM

Suppose my independent variables are categorical (How can I handle that in a Data step). Because class statement can not work in a data step. (Commented line)

%let NumSamples = 1000;  
data RegSim(drop=eta);
call streaminit(123);
set Work.swim; 
ObsNum = _N_; 
    *class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
eta = 1.0234 - (0.1665*Swimmer)- (0.5345*Location) - (0.3744*Age) -(0.1897*Age) - (0.0899*Sex); 
do SampleID = 1 to &NumSamples;
Y = rand("Poisson",lambda);
output;
end;
run;

Rick_SAS · Posted 02-06-2018 08:41 AM

I would generate dummy variables by using one of several SAS procedures, such as PROC GLMMOD.

If you feel confident in your understanding of reference parameterization, you can generate the dummy values directly in the DATA step. This is more prone to error and the correct code depends on the parameterization:

eta = 1.0234 - 0.1665*(Swimmer="Freq")
             - 0.5345*(Location="Beach") - ... - 0.0899*(Sex="Male");

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!