BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
john1111
Obsidian | Level 7

I want to simulate Zero inflated Poisson data sets 1000 times. I am using predicted coefficients and independent variables that are set as in the code below. But I have to include more conditions;

 

1. I want 70% of the resulting counts(the simulated y) in the interval [5,14] be rounded of to 10 for each data set.

 

    The thing is that the first code below simulates each observation 1000 times, then i'm using the second code to order the data so   that I have 1000 data sets. So the rounding is required for each data set.

 

%let NumSamples = 1000;  /* number of samples */
data Sim_data(drop=mu);
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
ObsNum = _N_; 
mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
ypoi = ranpoi(1234,mu);
pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
if ranuni(1234)>pzero then do;
ypoizim = ypoi;
end;
else do;
ypoizim = 0;
end;
y=ypoizim;
output;
end;
run;

 

 

 

proc sort data=Sim_data; 
by SampleID ObsNum;
run;

2. Again the same thing, rounding of 20% of resulting counts in the interval [25,35] to 30

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

Probably you want something similar to this:

 

%let NumSamples = 5;  /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
   ObsNum = _N_; 
   mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
   ypoi = rand("poisson", mu);
   pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
   if rand("Bernoulli", pzero) then 
      ypoizim = ypoi;
   else
      ypoizim = 0;
   y = ypoizim;

   /* I want 70% of the resulting counts(the simulated y) in the interval [5,14] 
      to be rounded of to 10 for each data set */
   if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
      y10 = round(y, 10);  /* round to nearest 10 */
   else 
      y10 = y;

   /* Again the same thing, rounding of 20% of resulting counts 
      in the interval [25,35] to 30 */
   if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
      y30 = round(y, 10);  /* round to nearest 10 */
   else 
      y30 = y;
   output;
end;
run;

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

View solution in original post

6 REPLIES 6
Rick_SAS
SAS Super FREQ

Probably you want something similar to this:

 

%let NumSamples = 5;  /* number of samples */
data Sim_data;
set WORK.xxx;
call streaminit(1234);
do SampleID=1 to &NumSamples; 
   ObsNum = _N_; 
   mu = exp(0.6773 - 0.01036*Age + 0.2080*height);
   ypoi = rand("poisson", mu);
   pzero = cdf("LOGISTIC",-0.9792 +0.03796*Age -0.1488*height);
   if rand("Bernoulli", pzero) then 
      ypoizim = ypoi;
   else
      ypoizim = 0;
   y = ypoizim;

   /* I want 70% of the resulting counts(the simulated y) in the interval [5,14] 
      to be rounded of to 10 for each data set */
   if (5 <= y <= 14) & rand("Bernoulli", 0.7) then
      y10 = round(y, 10);  /* round to nearest 10 */
   else 
      y10 = y;

   /* Again the same thing, rounding of 20% of resulting counts 
      in the interval [25,35] to 30 */
   if (25 <= y <= 30) & rand("Bernoulli", 0.2) then
      y30 = round(y, 10);  /* round to nearest 10 */
   else 
      y30 = y;
   output;
end;
run;

Incidentally, you are setting the random number seed by using STREAMINIT, but then you are using the old-style random number functions instead of RAND. If you are doing serious work, you should use RAND, which has better statistical properties. I updated the code to use RAND.

john1111
Obsidian | Level 7

Thanks a lot for the solution. That is precisely what I was looking for. But I just want to confirm something. I am coming up with some heaped data model, then I am planning to use simulation to verify that the heaped model would perform better for parameter estimation. So I simulated my original count observations then rounded off a few observations to get data that is similar to the real and observed data. If I simulate the data like this is it statistically correct because I suspect that the 1000 different data sets may be so varied from the original data set that I may actually not be able to make real comparisons.

 

My suspicion comes from the fact that after simulation I can't see the heaping for each data set and again the zeros are just too many (Extreme, I mean I expect it to be zero inflated but its more than that). In general my simulated data sets looses their property to the extent that I can't run the model on it.

 

How can I ensure that the properties of my data are maintained. I mean, my original data had counts between 0 to 60. My new data has counts only between 0 to 13. My original data was heaped at 25 therefore obviously I won't see that because the simulated data has no 25. How can I simulate heaped data?

Rick_SAS
SAS Super FREQ

Could you explain what you mean by "a heaped model" and "heaped at 25"? Or post data, if possible.

 

The key problem in constructing a simulation study is how to simulate the data. Most people do what you did: propose a MODEL of the data and simulate from the MODEL. Your results will match the data only if the model is a good fit.If the observed data has properties that the model does not (skewness, outliers, etc), then the results of the study might not capture those properties. 

 

If you can't find a good parametric model for the data, you can use a flexible family of distributions (see Chapter 16 of SImulating Data with SAS) or use the bootstrap, which is a nonparametric model (see Chapter 15). For more about ways to simulate data, see the diagram and discussion at https://blogs.sas.com/content/iml/2018/01/15/eyeball-distribution.html

 

john1111
Obsidian | Level 7

The data is as attached then I'm calling it heaped at point 5,10,12,15,20,15.

Because of the frequency of these numbers as shown in the following plot.

 

I have proposed a model and I want to use simulation to very either;

            1. The model gives estimates of counts, y that are very close to the observed y.

                                   OR

            2. That the model provides better results based on standard error of the estimates.

 

heaped_plot.PNG

Rick_SAS
SAS Super FREQ

I see. So the response would be something like "how long since you visited a foreign country. Many people will reply "5 years" or "10 years" because they are estimating something that happened long ago.

 

I do not have any experience modeling data like these. I suggest you do a literature review.  For example, the paper by van der Laan and Kuijvenhoven (2011) would be one way to treat this issue. A mixture model would be another.  Good luck!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 2763 views
  • 2 likes
  • 3 in conversation