Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- How to simulate data from Poisson regression

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 02-05-2018 06:07 AM
(3028 views)

I was reading a paper on Bootstrapping with models for count data. (Page 1170) www.researchgate.net/publication/51738951_Bootstrapping_with_Models_for_Count_Data

The author then says that...*...data similar to the observed data were simulated with the expected values of counts given by Eq. (4) and with the Poisson error inflated by the factor 2.75 using the zero inflated count model where an observed count is either the value zero with probability p or a random value from a Poisson distribution with probability 1 − p.*

I am trying to reproduce this so the first thing I did was to run the model to produce the equation 4 above. (Data attached)

```
proc genmod data = WORK.swim;
class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
model Infections = Swimmer Location Age Sex / dist=poisson;
run;
```

Using the coefficients, I'm then trying to simulate the data similar to the one observed, but I cant figure how to simulate the independent variables especially because I can't determine their distribution from the observed data attached.

```
%let N = 287;
%let nCont = 5;
data SimReg1(keep= Y x:);
call streaminit(54321);
array x[&nCont];
array beta[0:&nCont] _temporary_ (1.0234 -0.6115 -0.5345 -0.3744 -0.1897 -0.0899);
do i = 1 to &N;
/* x[1] = ...
x[2] = ... /*How do I distribute the independent variables so that when I re-run*/
x[3] = ... /* the new data I get the same coefficients above...*/
x[4] = .. */
eta = beta[0];
do j = 1 to &nCont;
eta = eta + beta[j] * x[j];
end;
lambda=exp(eta);
Y = rand("Poisson",lambda) ;
output;
end;
run;
```

How do I distribute the independent variables so that when I re-run the mode again I get the same coefficients.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model." If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

If you have my book *Simulating Data with SAS, *there are examples in Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It looks like you might have already looked at the blog posts "Simulate data for a linear regression model" and/or "Simulating data for a logistic model." If I understand your question, the answer is that you need to use the WORK.swim data set to provide the design matrix for the explanatory variables. It sounds like the researcher simulated the RESPONSE variables multiple times for a FIXED set of explanatory variables. This is the usual thing to do in fixed-effect models.

If you have my book *Simulating Data with SAS, *there are examples in Section 11.3 (pp. 202-204). Basically, you use the SET statement to set the data and use the implicit DATA step loop (instead of DO i = 1 TO &N) to iterate over the observations. It is often convenient to use an ARRAY statement to read the explanatory variables into an array:

A complete example also appears in Tip #9 of "Ten Tips for Simulating Data with SAS" (Wicklin 2015, p. 9-11). The example in that paper is for a linear regression, but the flow of control for the simulation is the same for generalized linear models.Instead of using the Explanatory data set shown in the paper, use WORK.swim.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Just got your book and it's on point, very clear, thanks a lot. But one thing I don't understand;

If I fit a model using different techniques (as used in the paper stated above) i.e Bootstrapping and Normal MLE and get different standard errors for the coefficient estimators, then simulate data similar to the observed data using MLE estimators (as is done in your book).

The Author of the paper above uses this simulated results to compare these two techniques:

*"...It was found that the standard errors from the MLE analysis were on average about 5% too low, the standard errors from bootstrap resampling of cases were on average about 3% too high..."*

My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

*> My question is this, does it mean that this simulated data is assumed to be the real data so that we can check how it is deviating when analyzed using different techniques?*

Yes. That is almost always the assumption of simulation studies. We draw many samples from (a model of) a population as a way to "repeat" the data collection scheme. We then analyze the simulated samples as if they were real data.

I wouldn't worry too much about the small differences (5% or 3%) between various estimates of the standard error.

I hesitate to criticize a paper that I have not studied carefully, but bear in mind that the standard error is not a property of the population (model), it is a property of the estimation method and the author cannot know the "true" value of the standard error.

The reference value in the paper is the Monte Carlo estimate from 1000 simulated data sets. This is the standard deviation of the 1000 regression estimates. The standard deviation is itself a statistic that depends on the variance of the data, higher-order moments, and the number of Monte Carlo simulations. So the "baseline" on which he bases his results is not truth, but is an estimate that itself can vary by several percentage points.

Thus a more conservative conclusion is that the parametric t-tests (which are based on an assumption of asymptotic normality of the regression estimates) produce a standard error on these data that is slightly larger than the standard error from either bootstrap method.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You are the best, thanks a lot

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Suppose my independent variables are categorical (How can I handle that in a Data step). Because class statement can not work in a data step. (Commented line)

```
%let NumSamples = 1000;
data RegSim(drop=eta);
call streaminit(123);
set Work.swim;
ObsNum = _N_;
*class Swimmer(Ref="Occas") Location (ref="NonBeach") Age(ref="15-19") Sex(ref="Female")/param=ref;
eta = 1.0234 - (0.1665*Swimmer)- (0.5345*Location) - (0.3744*Age) -(0.1897*Age) - (0.0899*Sex);
do SampleID = 1 to &NumSamples;
Y = rand("Poisson",lambda);
output;
end;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I would generate dummy variables by using one of several SAS procedures, such as PROC GLMMOD.

If you feel confident in your understanding of reference parameterization, you can generate the dummy values directly in the DATA step. This is more prone to error and the correct code depends on the parameterization:

```
eta = 1.0234 - 0.1665*(Swimmer="Freq")
- 0.5345*(Location="Beach") - ... - 0.0899*(Sex="Male");
```

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.