BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Fluvio1
Calcite | Level 5

I have adapted the bootstrap example from Oct 24, 2018 Rick Wicklin DO Loop and it successfully runs with my data. However, I am not sure how

to generate the 95% confidence intervals for each predicted observation?

 

I have attached the program and the dataset.  Please advise. Thanks.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

Yes, you can get the bootstrap CLM at each value of the explanatory variable.

First, read about how to perform case resampling in the blog post "Bootstrap regression estimates: Case resampling":

Then, you can read about how to obtain bootstrap estimates of the CLM in the blog post "Bootstrap confidence intervals for the predicted mean in a regression model"

 

For completeness, the following program implements the bootstrap method to estimate the CLM at a set of points that you can specify:

/* choose data to analyze. Optionally, sort. */
data Sample(keep=x y);
   /* rename vars to X and Y to make roles easier to understand */
   set Sashelp.Class(rename=(Weight=Y Height=X));  
run;
/* OPTIONAL: Sort by X variable for future plotting */
proc sort data=Sample;  by X;  run;

/* create a scoring set */
data ScoreData;
   do X = 50 to 75 by 5;
      ScoreId + 1;         /* include an ID variable */
      output;
   end;
run;

/* 1. Compute the statistics on the original data. Estimate CLM by using the asymptotic formula.
      Concatenate the scoring data and use the "missing response trick" 
      https://blogs.sas.com/content/iml/2014/02/17/the-missing-value-trick-for-scoring-a-regression-model.html
*/
data SampleAndScore;
   set Sample ScoreData(in=S);
   IsScore = S;
run;
proc reg data=SampleAndScore plots(only)=FitPlot(nocli);
   model Y = X;
   /* output the theoretical CI for each scoring observation for later comparison */
   output out=RegOut(where=(IsScore=1)) Pred=Pred LCLM=LowerP UCLM=UpperP;
run; quit;
 
/* 2. Generate many bootstrap samples by using PROC SURVEYSELECT */
%let NumSamples = 5000;       /* number of bootstrap resamples */ 
proc surveyselect data=Sample NOPRINT seed=1
     out=BootCases(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                (optional) use OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 3. Compute the statistics for each bootstrap sample. */
proc reg data=BootCases outest=PE noprint;    /* fit the model and save the parameter estimates for each sample */
   by SampleID;
   freq NumberHits;
   model Y = X;
run;quit;
 
proc score data=ScoreData score=PE type=parms predict /* score each model on the scoring data */
           out=BootOut(rename=(MODEL1=Pred));
   by SampleID;
   var X;
run;

/* 4. Compute the bootstrap 95% prediction CL for each row in the score data */
/* Find the 2.5th and 97.5th percentiles for the bootstrap statistics each X value */
proc univariate data=BootOut noprint;
   class ScoreID;
   var Pred;
   output out=BootCL mean=BootMean pctlpre=Pctl_ pctlpts=2.5, 97.5; 
run;

/* merge the REG output and the bootstrap CL */
data All;
   merge RegOut BootCL;
   by ScoreID;
run;

proc print data=All; 
  var ScoreId X BootMean Pctl_2_5 Pctl_97_5;
run;
 
/* visualize the bootstrap mean and CLM */
title "Bootstrap CLM";
title2 "&NumSamples Bootstrap Samples";
proc sgplot data=All;
   series x=x y=Pred / legendlabel="OLS Pred" lineattrs=(color=gray);
   scatter x=x y=BootMean / legendlabel="Boot Pred" markerattrs=(symbol=X);
   highlow x=x low=Pctl_2_5 high=Pctl_97_5 / legendlabel="Boot CLM"
               lowcap=serif highcap=serif lineattrs=GraphError;
   yaxis label="Y";
run;

View solution in original post

10 REPLIES 10
ballardw
Super User

There is no reason to hide programs in attachments unless it is 100's of lines long. Just paste into a text box opened on the forum with the </> icon that appears above the message window.

 

******************************************************************************************************************************************************************************;
**************************************************SAS USER  Fluvio1*******SAS program:   Fluvio_Reg_Boot_CLM_1a.sas   ********************************************************;
****************************************************************Program for SAS support  07282020*****************************************************************************;

************************************************Compute OLS regression and develop upper and lower 95% confidence limits by bootstrapping*********************************;
**************************************************Code below adapted from  Oct 24, 2018 Rick Wicklin DO Loop ************************************************************;


%web_drop_table(WORK.IMPORT);


FILENAME REFFILE '/home/u43067642/Irreff_gen/FInal_POLS_digitizing_error/NEW_DATA_1.xlsx';

PROC IMPORT DATAFILE=REFFILE
	DBMS=XLSX
	OUT=WORK.New_data_1;
	GETNAMES=YES;
RUN;
proc contents data=New_data_1;
run;

/* 1. compute the statistics on the original data */
proc reg data=new_data_1 plot ;
   model Y = X / CLB covb ;                          /* original estimates */
run; quit;		
				
	title "Bootstrap Distribution of Regression Estimates";
title2 "Case Resampling";
%let NumSamples = 5000;       /* number of bootstrap resamples */
%let IntEst =0.05675;     /* original estimates for later visualization */
%let XEst   =  0.99636;
 
/* 2. Generate many bootstrap samples by using PROC SURVEYSELECT */
proc surveyselect data=new_data_1 NOPRINT seed=1
     out=BootCases(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                 use OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;	
				
/* 3. Compute the statistic for each bootstrap sample */
proc reg data=BootCases outest=PEBoot noprint;
   by SampleID;
   freq NumberHits;
   model y=x ;
run;quit;		
	
	/*Data new_data_1;
	length EXPboot_x EXPboot_y  8;
	set bootcases;
	if y='.' then do;
	delete;
	end;
	EXPboot_x=EXP(x);
	EXPboot_y=EXP(y);	
	run;*/		
	
	/* 4. Visualize bootstrap distribution */
proc sgplot data=PEBoot;
   label Intercept = "Estimate of Intercept" X = "Estimate of Coefficient of X";
   scatter x=Intercept y=X / markerattrs=(Symbol=CircleFilled) transparency=0.7;
   /* Optional: draw reference lines at estimates for original data */
   refline &IntEst / axis=x lineattrs=(color=blue);
   refline &XEst / axis=y lineattrs=(color=blue);
   xaxis grid; yaxis grid;
run;
	
proc stdize data=PEBoot vardef=N pctlpts=5 95  PctlMtd=ORD_STAT outstat=Pctls;
   var Intercept X;
run;
proc report data=Pctls nowd;
  where _type_ =: 'P';
  label _type_ = 'Confidence Limit';
  columns ('Bootstrap Confidence Intervals (B=5000)' _ALL_);
run;	
	

Many users here don't want to download Excel files because of virus potential, others have such things blocked by security software. Also if you give us Excel we have to create a SAS data set and due to the non-existent constraints on Excel data cells the result we end up with may not have variables of the same type (numeric or character) and even values.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

Fluvio1
Calcite | Level 5

I am not sure what "should calling Rick......" means ??

Rick_SAS
SAS Super FREQ

KSharp likes to alert experts that there is a post that might interest them. He uses an "@ mention" syntax to include their name in a thread, which sends an email to the expert. So KSharp has already "called" me by mentioning my name. 

 

I think we need to clarify what you are asking for. You can certainly get a 95% bootstrap CI for the predicted MEAN, but I don't think the bootstrap is the right tool to get a CI for individual responses. Which are you trying to find?  In PROC REG, this is the difference between the popular LCLM/UCLM options (CL for the conditional mean) and the rarely used LCLI/UCLI options. See SAS Help Center: Predicted and Residual Values

Fluvio1
Calcite | Level 5

Thanks Rick.  I looked at the SAS help center predicted and residual values but that does not cover the issue I have.  What I would like to do is bootstrap the confidence limits so that I have the 95% LCLM/UCLM values bootstrapped for each of the OLS predicted values.  I do not want  CLI  that would be used if I needed to predict individual responses.  I looked at the  TSpline example (89.5) and tried to adapt it to proc reg  but with no luck. Here is the output similar to what I would like to have  :

Fluvio1_1-1690823962962.png

 

 

 

 

Rick_SAS
SAS Super FREQ

Yes, you can get the bootstrap CLM at each value of the explanatory variable.

First, read about how to perform case resampling in the blog post "Bootstrap regression estimates: Case resampling":

Then, you can read about how to obtain bootstrap estimates of the CLM in the blog post "Bootstrap confidence intervals for the predicted mean in a regression model"

 

For completeness, the following program implements the bootstrap method to estimate the CLM at a set of points that you can specify:

/* choose data to analyze. Optionally, sort. */
data Sample(keep=x y);
   /* rename vars to X and Y to make roles easier to understand */
   set Sashelp.Class(rename=(Weight=Y Height=X));  
run;
/* OPTIONAL: Sort by X variable for future plotting */
proc sort data=Sample;  by X;  run;

/* create a scoring set */
data ScoreData;
   do X = 50 to 75 by 5;
      ScoreId + 1;         /* include an ID variable */
      output;
   end;
run;

/* 1. Compute the statistics on the original data. Estimate CLM by using the asymptotic formula.
      Concatenate the scoring data and use the "missing response trick" 
      https://blogs.sas.com/content/iml/2014/02/17/the-missing-value-trick-for-scoring-a-regression-model.html
*/
data SampleAndScore;
   set Sample ScoreData(in=S);
   IsScore = S;
run;
proc reg data=SampleAndScore plots(only)=FitPlot(nocli);
   model Y = X;
   /* output the theoretical CI for each scoring observation for later comparison */
   output out=RegOut(where=(IsScore=1)) Pred=Pred LCLM=LowerP UCLM=UpperP;
run; quit;
 
/* 2. Generate many bootstrap samples by using PROC SURVEYSELECT */
%let NumSamples = 5000;       /* number of bootstrap resamples */ 
proc surveyselect data=Sample NOPRINT seed=1
     out=BootCases(rename=(Replicate=SampleID))
     method=urs              /* resample with replacement */
     samprate=1              /* each bootstrap sample has N observations */
     /* OUTHITS                (optional) use OUTHITS option to suppress the frequency var */
     reps=&NumSamples;       /* generate NumSamples bootstrap resamples */
run;

/* 3. Compute the statistics for each bootstrap sample. */
proc reg data=BootCases outest=PE noprint;    /* fit the model and save the parameter estimates for each sample */
   by SampleID;
   freq NumberHits;
   model Y = X;
run;quit;
 
proc score data=ScoreData score=PE type=parms predict /* score each model on the scoring data */
           out=BootOut(rename=(MODEL1=Pred));
   by SampleID;
   var X;
run;

/* 4. Compute the bootstrap 95% prediction CL for each row in the score data */
/* Find the 2.5th and 97.5th percentiles for the bootstrap statistics each X value */
proc univariate data=BootOut noprint;
   class ScoreID;
   var Pred;
   output out=BootCL mean=BootMean pctlpre=Pctl_ pctlpts=2.5, 97.5; 
run;

/* merge the REG output and the bootstrap CL */
data All;
   merge RegOut BootCL;
   by ScoreID;
run;

proc print data=All; 
  var ScoreId X BootMean Pctl_2_5 Pctl_97_5;
run;
 
/* visualize the bootstrap mean and CLM */
title "Bootstrap CLM";
title2 "&NumSamples Bootstrap Samples";
proc sgplot data=All;
   series x=x y=Pred / legendlabel="OLS Pred" lineattrs=(color=gray);
   scatter x=x y=BootMean / legendlabel="Boot Pred" markerattrs=(symbol=X);
   highlow x=x low=Pctl_2_5 high=Pctl_97_5 / legendlabel="Boot CLM"
               lowcap=serif highcap=serif lineattrs=GraphError;
   yaxis label="Y";
run;
Fluvio1
Calcite | Level 5

Many thanks to Rick and other who contributed to this post.  I was able to modify the code to fit my application.

Fluvio1
Calcite | Level 5

I do have a further question.  The program provided by Rick (for bootstrapping OLS regression confidence limits) works well. However, I should have pointed out that the values of X and Y that I am regressing are in natural log units.  Given the properties of ln data, will this make any difference in the bootstrap ?

Rick_SAS
SAS Super FREQ

It doesn't make any difference, but if you want the confidence intervals on the original data scale, exponentiate the estimates.

That is, if

(L(x), U(x)) is the confidence interval for the mean of log(Y) when X=x, then 

(exp(L(x)), exp(U(x))) is the confidence interval for the mean of Y.

sbxkoenk
SAS Super FREQ

Hello @Fluvio1 ,

 

Response to your query has been given 40 minutes ago ... by @Rick_SAS .

 

I am just adding this page in case someone wonders where the tpspline plot in your last post comes from.
It took me a while to figure it out. 😉

 

But here's the code for that plot :

 

The TPSPLINE Procedure
Example 125.5 Computing a Bootstrap Confidence Interval
SAS/STAT User's Guide (for SAS Viya 3.5)
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_tpspline_examples05.htm

 

BR, Koen

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1451 views
  • 11 likes
  • 5 in conversation