BookmarkSubscribeRSS Feed
T_Reddy
Obsidian | Level 7

Hi there. I am trying to use the parametric bootstrap to construct prediction intervals for 5, 10 and 15 day forecasts of the cumulative cases of a disease . I have fitted a three parameter logistic model to cumulative cases and use these parameter estimates to simulate daily new cases (the derivative of cumulative cases which is a function of cumulative cases) for the 1000 datasets. My bootstrap confidence intervals do not contain the true value for the various predictions though and I am not sure what I am doing wrong. I would greatly appreciate any assistance.

/* Simulate multiple samples from a regression model */

data explanatory;set covid_window;keep time i_25 i_30 i_35 i_40;where time le 48;run;

proc print data=parm2_log3pl;var parm_log3pl est_log3pl;run;

%let NumSamples = 1000; /* number of samples */

data RegSim;

call streaminit(123);

set Explanatory; /* implicit loop over obs*/

ObsNum = _N_; /* observation number */

b1=2037.03 ;

b11=2555.72;

b2=0.3301;

b21=-0.2659;

b24=0.07861;

b23=0.001893;

b22=0.003788;

b3=22.8728;

b31=17.8590;

 

gamma=b2+(b21*i_25)+(b22*i_30)+(b23*i_35)+(b24*i_40);

eta=b3+(b31*i_25);

alpha=b1+(b11*i_25);

mu = alpha/(1+exp(-gamma*(time-eta)));

lambda = (gamma*mu)*(1-(mu/alpha));

do SampleID = 1 to &NumSamples;

Y = rand("Poisson",lambda);

output;

end;

run;

 

 

proc sort data=RegSim; /* sort for BY-group processing */

by SampleID ObsNum;

run;

 

/********* Now generate cumulative************/

data RegSim;set RegSim;by SampleID;

if first.SampleID then

do;

total_cases = 0;

end;

 

total_cases + Y;

run;

data pred_window;

do SampleID=1 to 1000;

do time=49 to 100;

output;

end;

end;

run;

data covid_window;set RegSim pred_window;run;

data covid_window;set covid_window;

i_25=.;if time le 25 then i_25=0; if time gt 25 then i_25=1;

i_30=.;if time le 30 then i_30=0; if time gt 30 then i_30=1;

i_35=.;if time le 35 then i_35=0; if time gt 35 then i_35=1;

i_40=.;if time le 40 then i_40=0; if time gt 40 then i_40=1;

run;

proc sort data=covid_window;by SampleID time;run;

proc nlmixed data=covid_window gconv=0;

parms b1=2000 b11=2916 b2=0.43 b21=-0.35 b24=0 b23=0 b22=0 b3=20 b31=18;

by SampleID;

 

gamma=b2+(b21*i_25)+(b22*i_30)+(b23*i_35)+(b24*i_40);

eta=b3+(b31*i_25);

alpha=b1+(b11*i_25);

mu = alpha/(1+exp(-gamma*(time-eta)));

model total_cases ~ poisson(mu);

predict mu out=outputBS_log3pl;

run;

proc sort data=outputBS_log3pl;by time sampleID;run;

proc univariate data=outputBS_log3pl noprint;

var pred;

by time;

output out=Pctls pctlpts = 2.5 97.5

pctlpre = mu

pctlname = lb_2_5 ub_97_5;

run;

 

10 REPLIES 10
Rick_SAS
SAS Super FREQ

Can you post a few observations from the EXPLANATORY data?

T_Reddy
Obsidian | Level 7

 time i_25 i_30 i_35 i_40

10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
110000
120000
130000
140000
150000
160000
170000
180000
190000
200000
210000
220000
230000
240000
250000
261000
271000
281000
291000
301000
311100
321100
331100
341100
351100
361110
371110
381110
391110
401110
411111
421111
431111
441111
451111
461111
471111
481111
Rick_SAS
SAS Super FREQ

According to my computations, the simulation is working fine. The parameter estimates are close to the parameter values and the confidence intervals encompass the parameter values. If you are seeing something strange, please describe it in detail for NUmSamples=3.

 

/* Simulate multiple samples from a regression model */
data explanatory;
input time i_25 i_30 i_35 i_40;
datalines;
1	0	0	0	0
2	0	0	0	0
3	0	0	0	0
4	0	0	0	0
5	0	0	0	0
6	0	0	0	0
7	0	0	0	0
8	0	0	0	0
9	0	0	0	0
10	0	0	0	0
11	0	0	0	0
12	0	0	0	0
13	0	0	0	0
14	0	0	0	0
15	0	0	0	0
16	0	0	0	0
17	0	0	0	0
18	0	0	0	0
19	0	0	0	0
20	0	0	0	0
21	0	0	0	0
22	0	0	0	0
23	0	0	0	0
24	0	0	0	0
25	0	0	0	0
26	1	0	0	0
27	1	0	0	0
28	1	0	0	0
29	1	0	0	0
30	1	0	0	0
31	1	1	0	0
32	1	1	0	0
33	1	1	0	0
34	1	1	0	0
35	1	1	0	0
36	1	1	1	0
37	1	1	1	0
38	1	1	1	0
39	1	1	1	0
40	1	1	1	0
41	1	1	1	1
42	1	1	1	1
43	1	1	1	1
44	1	1	1	1
45	1	1	1	1
46	1	1	1	1
47	1	1	1	1
48	1	1	1	1
;

%let NumSamples = 3; /* number of samples */

data RegSim;
call streaminit(123);
set Explanatory; /* implicit loop over obs*/
ObsNum = _N_; /* observation number */
b1=2037.03 ;
b11=2555.72;
b2=0.3301;
b21=-0.2659;
b24=0.07861;
b23=0.001893;
b22=0.003788;
b3=22.8728;
b31=17.8590;
gamma=b2+(b21*i_25)+(b22*i_30)+(b23*i_35)+(b24*i_40);
eta=b3+(b31*i_25);
alpha=b1+(b11*i_25);
mu = alpha/(1+exp(-gamma*(time-eta)));
lambda = (gamma*mu)*(1-(mu/alpha));
do SampleID = 1 to &NumSamples;
   Y = rand("Poisson",lambda);
   output;
end;
drop b: gamma eta alpha mu lambda;
run;

proc sort data=RegSim; /* sort for BY-group processing */
by SampleID ObsNum;
run;

proc sgplot data=Regsim;
series x=Time y=Y / group=SampleID;
run;
title;

/********* Now generate cumulative************/
data RegSimCum;set RegSim;by SampleID;
if first.SampleID then do;
   total_cases = 0;
end;
total_cases + Y;
run;

proc nlmixed data=RegSimCum gconv=0;
by SampleID;
   parms b1=2000 b11=2916 b2=0.43 b21=-0.35 b24=0 b23=0 b22=0 b3=20 b31=18;
   gamma=b2+(b21*i_25)+(b22*i_30)+(b23*i_35)+(b24*i_40);
   eta=b3+(b31*i_25);
   alpha=b1+(b11*i_25);
   mu = alpha/(1+exp(-gamma*(time-eta)));
   model total_cases ~ poisson(mu);
   predict mu out=outputBS_log3pl;
   ODS SELECT  ParameterEstimates;
run;
T_Reddy
Obsidian | Level 7

Thank you for the speedy response. The issue is that the confidence interval estimated does not contain the value of the actual point estimate (for example the prediction for day 53 is close to 3900 but both the lower and upper bound of the CI are above 4000). It could signal an issue with the algorithm possibly

 

Rick_SAS
SAS Super FREQ

There are MANY estimates in this problem, and I do not know whether you are talking about regression estimates, predicted values (in some data set?), or something else. Please provide a complete example that uses SAS code we can run. 

SteveDenham
Jade | Level 19

I think the problem may lie in the area of the data with which the model is being fit.  The model is fit to data up to day 48, and then is used in forecasting mode from day 49 to day 100.  To have one data point outside the 95% bounds calculated by bootstrapping doesn't sound so bad (assuming the true value at day 53 is what is being compared to the forecast interval at day 53).  I would like to think that the parameter estimates would differ considerably if you fit out to day 60 for instance. The data point that seems incorrectly forecast would likely have a large leverage, and cause a shift in the estimates, with a resulting difference in the forecast values and the confidence bounds compared to the original model.

 

Like @Rick_SAS says, please post data and code.

 

SteveDenham

T_Reddy
Obsidian | Level 7

Thank you for your response. I have been examining the predictions and it looks like the issue is how I have programmed the piecewise model (splines). I use a separate dummy variable for each period (i_25, i_30, I_35 etc) and this results in a discontinuity, I would need to reparameterize, Is there any simple way (like the effect statement) in Proc NLIN or NLMIXED?

 

Rick_SAS
SAS Super FREQ

There is no need to generate the spline bases yourself. You can get SAS to generate it for you. See the article "Visualize a regression with splines," which shows how to use the OUTDESIGN= option in PROC GLMSELECT to generate the spline basis.

T_Reddy
Obsidian | Level 7

This is indeed extremely useful! My model is nonlinear (mu = alpha/(1+exp(-gamma*(time-eta))). It does not appear that any of the procedures which support the effect statement will be able to accomodate our modeling approach. 

Rick_SAS
SAS Super FREQ

I'm not sure what you are saying. After you generate the spline basis, you can use those variables in any procedure.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1109 views
  • 5 likes
  • 3 in conversation