Solved: Re: How should I simulate Zero inflated data with extra conditions

john1111 · Posted 03-21-2018 06:23 AM

I want to simulate Zero inflated Poisson data sets 1000 times. I am using predicted coefficients and independent variables that are set as in the code below. But I have to include more conditions;

1. I want 70% of the resulting counts(the simulated y) in the interval [5,14] be rounded of to 10 for each data set.

The thing is that the first code below simulates each observation 1000 times, then i'm using the second code to order the data so that I have 1000 data sets. So the rounding is required for each data set.

%let NumSamples = 1000;  /* number of samples */
data Sim_data(drop=mu);
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
ObsNum = _N_; 
mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
ypoi = ranpoi(1234,mu);
pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
if ranuni(1234)>pzero then do;
ypoizim = ypoi;
end;
else do;
ypoizim = 0;
end;
y=ypoizim;
output;
end;
run;

proc sort data=Sim_data; 
by SampleID ObsNum;
run;

2. Again the same thing, rounding of 20% of resulting counts in the interval [25,35] to 30

Rick_SAS · Posted 03-21-2018 09:03 AM

Probably you want something similar to this:

%let NumSamples = 5;  /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
   ObsNum = _N_; 
   mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
   ypoi = rand("poisson", mu);
   pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
   if rand("Bernoulli", pzero) then 
      ypoizim = ypoi;
   else
      ypoizim = 0;
   y = ypoizim;

   /* I want 70% of the resulting counts(the simulated y) in the interval [5,14] 
      to be rounded of to 10 for each data set */
   if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
      y10 = round(y, 10);  /* round to nearest 10 */
   else 
      y10 = y;

   /* Again the same thing, rounding of 20% of resulting counts 
      in the interval [25,35] to 30 */
   if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
      y30 = round(y, 10);  /* round to nearest 10 */
   else 
      y30 = y;
   output;
end;
run;

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

View solution in original post

Ksharp · Posted 03-21-2018 08:27 AM

Calling @Rick_SAS

Rick_SAS · Posted 03-21-2018 09:03 AM

Probably you want something similar to this:

%let NumSamples = 5;  /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
   ObsNum = _N_; 
   mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
   ypoi = rand("poisson", mu);
   pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
   if rand("Bernoulli", pzero) then 
      ypoizim = ypoi;
   else
      ypoizim = 0;
   y = ypoizim;

   /* I want 70% of the resulting counts(the simulated y) in the interval [5,14] 
      to be rounded of to 10 for each data set */
   if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
      y10 = round(y, 10);  /* round to nearest 10 */
   else 
      y10 = y;

   /* Again the same thing, rounding of 20% of resulting counts 
      in the interval [25,35] to 30 */
   if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
      y30 = round(y, 10);  /* round to nearest 10 */
   else 
      y30 = y;
   output;
end;
run;

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

john1111 · Posted 03-22-2018 03:38 AM

Thanks a lot for the solution. That is precisely what I was looking for. But I just want to confirm something. I am coming up with some heaped data model, then I am planning to use simulation to verify that the heaped model would perform better for parameter estimation. So I simulated my original count observations then rounded off a few observations to get data that is similar to the real and observed data. If I simulate the data like this is it statistically correct because I suspect that the 1000 different data sets may be so varied from the original data set that I may actually not be able to make real comparisons.

My suspicion comes from the fact that after simulation I can't see the heaping for each data set and again the zeros are just too many (Extreme, I mean I expect it to be zero inflated but its more than that). In general my simulated data sets looses their property to the extent that I can't run the model on it.

How can I ensure that the properties of my data are maintained. I mean, my original data had counts between 0 to 60. My new data has counts only between 0 to 13. My original data was heaped at 25 therefore obviously I won't see that because the simulated data has no 25. How can I simulate heaped data?

Rick_SAS · Posted 03-22-2018 06:05 AM

Could you explain what you mean by "a heaped model" and "heaped at 25"? Or post data, if possible.

The key problem in constructing a simulation study is how to simulate the data. Most people do what you did: propose a MODEL of the data and simulate from the MODEL. Your results will match the data only if the model is a good fit.If the observed data has properties that the model does not (skewness, outliers, etc), then the results of the study might not capture those properties.

If you can't find a good parametric model for the data, you can use a flexible family of distributions (see Chapter 16 of SImulating Data with SAS) or use the bootstrap, which is a nonparametric model (see Chapter 15). For more about ways to simulate data, see the diagram and discussion at https://blogs.sas.com/content/iml/2018/01/15/eyeball-distribution.html

john1111 · Posted 03-22-2018 06:35 AM

The data is as attached then I'm calling it heaped at point 5,10,12,15,20,15.

Because of the frequency of these numbers as shown in the following plot.

I have proposed a model and I want to use simulation to very either;

1. The model gives estimates of counts, y that are very close to the observed y.

OR

2. That the model provides better results based on standard error of the estimates.

Rick_SAS · Posted 03-22-2018 08:42 AM

I see. So the response would be something like "how long since you visited a foreign country. Many people will reply "5 years" or "10 years" because they are estimating something that happened long ago.

I do not have any experience modeling data like these. I suggest you do a literature review. For example, the paper by van der Laan and Kuijvenhoven (2011) would be one way to treat this issue. A mixture model would be another. Good luck!

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!