turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- SAS Procedures
- /
- How should I simulate Zero inflated data with extr...

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-21-2018 06:23 AM

I want to simulate Zero inflated Poisson data sets 1000 times. I am using predicted coefficients and independent variables that are set as in the code below. But I have to include more conditions;

1. I want 70% of the resulting counts(the simulated y) in the interval [5,14] be rounded of to 10 for each data set.

The thing is that the first code below simulates each observation 1000 times, then i'm using the second code to order the data so that I have 1000 data sets. So the rounding is required for each data set.

```
%let NumSamples = 1000; /* number of samples */
data Sim_data(drop=mu);
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples;
ObsNum = _N_;
mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
ypoi = ranpoi(1234,mu);
pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
if ranuni(1234)>pzero then do;
ypoizim = ypoi;
end;
else do;
ypoizim = 0;
end;
y=ypoizim;
output;
end;
run;
```

```
proc sort data=Sim_data;
by SampleID ObsNum;
run;
```

2. Again the same thing, rounding of 20% of resulting counts in the interval [25,35] to 30

Accepted Solutions

Solution

03-25-2018
03:37 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to john1111

03-21-2018 09:03 AM

Probably you want something similar to this:

```
%let NumSamples = 5; /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples;
ObsNum = _N_;
mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
ypoi = rand("poisson", mu);
pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
if rand("Bernoulli", pzero) then
ypoizim = ypoi;
else
ypoizim = 0;
y = ypoizim;
/* I want 70% of the resulting counts(the simulated y) in the interval [5,14]
to be rounded of to 10 for each data set */
if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
y10 = round(y, 10); /* round to nearest 10 */
else
y10 = y;
/* Again the same thing, rounding of 20% of resulting counts
in the interval [25,35] to 30 */
if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
y30 = round(y, 10); /* round to nearest 10 */
else
y30 = y;
output;
end;
run;
```

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to john1111

03-21-2018 08:27 AM

Calling @Rick_SAS

Solution

03-25-2018
03:37 PM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to john1111

03-21-2018 09:03 AM

Probably you want something similar to this:

```
%let NumSamples = 5; /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples;
ObsNum = _N_;
mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
ypoi = rand("poisson", mu);
pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
if rand("Bernoulli", pzero) then
ypoizim = ypoi;
else
ypoizim = 0;
y = ypoizim;
/* I want 70% of the resulting counts(the simulated y) in the interval [5,14]
to be rounded of to 10 for each data set */
if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
y10 = round(y, 10); /* round to nearest 10 */
else
y10 = y;
/* Again the same thing, rounding of 20% of resulting counts
in the interval [25,35] to 30 */
if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
y30 = round(y, 10); /* round to nearest 10 */
else
y30 = y;
output;
end;
run;
```

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

03-22-2018 03:38 AM - edited 03-22-2018 04:21 AM

Thanks a lot for the solution. That is precisely what I was looking for. But I just want to confirm something. I am coming up with some heaped data model, then I am planning to use simulation to verify that the heaped model would perform better for parameter estimation. So I simulated my original count observations then rounded off a few observations to get data that is similar to the real and observed data. If I simulate the data like this is it statistically correct because I suspect that the 1000 different data sets may be so varied from the original data set that I may actually not be able to make real comparisons.

My suspicion comes from the fact that after simulation I can't see the heaping for each data set and again the zeros are just too many (Extreme, I mean I expect it to be zero inflated but its more than that). In general my simulated data sets looses their property to the extent that I can't run the model on it.

How can I ensure that the properties of my data are maintained. I mean, my original data had counts between 0 to 60. My new data has counts only between 0 to 13. My original data was heaped at 25 therefore obviously I won't see that because the simulated data has no 25. How can I simulate heaped data?

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to john1111

03-22-2018 06:05 AM

Could you explain what you mean by "a heaped model" and "heaped at 25"? Or post data, if possible.

The key problem in constructing a simulation study is how to simulate the data. Most people do what you did: propose a MODEL of the data and simulate from the MODEL. Your results will match the data only if the model is a good fit.If the observed data has properties that the model does not (skewness, outliers, etc), then the results of the study might not capture those properties.

If you can't find a good parametric model for the data, you can use a flexible family of distributions (see Chapter 16 of *SImulating Data with SAS*) or use the bootstrap, which is a nonparametric model (see Chapter 15). For more about ways to simulate data, see the diagram and discussion at https://blogs.sas.com/content/iml/2018/01/15/eyeball-distribution.html

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Rick_SAS

03-22-2018 06:35 AM

The data is as attached then I'm calling it heaped at point 5,10,12,15,20,15.

Because of the frequency of these numbers as shown in the following plot.

I have proposed a model and I want to use simulation to very either;

1. The model gives estimates of counts, y that are very close to the observed y.

OR

2. That the model provides better results based on standard error of the estimates.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to john1111

03-22-2018 08:42 AM

I see. So the response would be something like "how long since you visited a foreign country. Many people will reply "5 years" or "10 years" because they are estimating something that happened long ago.

I do not have any experience modeling data like these. I suggest you do a literature review. For example, the paper by van der Laan and Kuijvenhoven (2011) would be one way to treat this issue. A mixture model would be another. Good luck!