BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
palolix
Quartz | Level 8

Dear SAS Community,

 

How do you back-transform a dependent variable in SAS when using box-cox?

 

Here are the steps I did in order to do a box-cox transformation of my dependent continuous variable.


proc transreg data=one;
Where Variety='BL516' and Season=2021;
model boxcox(AvgFirm) = identity(Wks);
run;


data new_one;
set one;
new_AvgFirm = (AvgFirm**(1.5) - 1) / 1.5;
run;

 

proc reg data=new_one;
model new_AvgFirm = Wks;
run;

 

I would greatly appreciate your help!

 

Thanks

Caroline

 

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

Just a few last remarks:

  1. The R-square and adjusted R-square values are better for the new_AvgFirm variable, so the B-C transform did improve the fit, but only by a little.
  2. The original residuals indicate homoscedasticity (compare the range of the residuals for each week) and an extreme outlier. You might want to double-check the response value for the obs that has a high Cook's D value.
  3. You are treating WEEKS as continuous. The Residual-by-predicted-value plot indicates that there is a trend that you are not capturing in the model. You might want to add a WEEKS**2 term or treat WEEKS as categorical (see below).
  4. The diagnostics plots indicate that WEEKS might be better modeled as a CLASS variable. That will give you more parameters but should result in a better model. You probably don't need to use B-C at all if you change from PROC REG to PROC GLM and treat WEEKS as categorical.

In summary, I don't think your problem is the B-C transformation. I think you have other modeling issues. Good luck with your project.  

View solution in original post

14 REPLIES 14
Rick_SAS
SAS Super FREQ

You are using a transformation of the form

Z = (Y^p - 1) / p.

You can invert this transformation by solving for Y. You get

Y = (p*Z + 1)^(1/p).

So, if you want to know how the regression model for Z relates to Y, use PROC REG to output the predicted values for Z, then use a DATA step to apply the inverse transformation. For example, here is some code that uses the Sashelp.class data:


data new_one;
set sashelp.class;
new_weight = (weight**(1.5) - 1) / 1.5;
run;

proc reg data=new_one;
model new_weight = Height;
output out=RegOut P=new_pred;
run;

data pred;
set RegOut;
pred = (1.5*new_pred + 1)**(1/1.5);
run;

/* visualize the back-transformed model */
proc sort data=pred; by height; run;
 
proc sgplot data=pred;
scatter x=Height y=weight;
series x=Height y=pred;
run;
palolix
Quartz | Level 8

That was a big help, thank you so much Rick_SAS!

 

In the second data step, P=new_pred would be P=new_weight?

 

Thanks!

 

 

Rick_SAS
SAS Super FREQ

Yes. I was back-transforming the predicted values, but the same formula applies to the observed responses.

palolix
Quartz | Level 8

Thank you. So If I want to get my parameter estimates for the back-transformed outcome variable then I can run a new regression model using the back-transformed predicted values.

 

In this case:

proc reg data=pred;
model pred = height;
run;

 

Right?

 

 

Rick_SAS
SAS Super FREQ

Perhaps I am not understanding your question. "Parameter estimates for the back-transformed outcome variable" are not defined. The Box-Cox transformation is nonlinear, so you can't invert the transformation without destroying the linearity of the model.

 

Suppose the original response is Y, and Z is the result of the Box-Cox transformation, Then you fit the linear model 

Z = b0 + b1*X.

But if you try to invert the transformation, you no longer get a linear model.  You get 

Y^p = (p*b0 +1) + (p*b1)*X

If you take the p_th root, you do not get a linear model for Y.

palolix
Quartz | Level 8

Thanks again for your reply Rick_SAS. Sorry for the confusion, right, in order to make predictions I will need to fit a linear  model using the new response variable in this equation Z = b0 + b1*X, and the backtransformation will give me the geometric means I suppose.

 

 

Rick_SAS
SAS Super FREQ

I have written about the Box-Cox transformation at The Box-Cox transformation for a dependent variable in a regression - The DO Loop (sas.com)

As I state there, "the Box-Cox [model] can be hard to interpret" because you want to predict a variable Y, but the B-C transformation provides a linear model for Z.  You can use back-transformation on the predicted values, but not on the model except for special cases such as lambda=0, which corresponds to a log transformation.

palolix
Quartz | Level 8

Thank you very much for sharing that blog, very valuable information.

I tried the box-cox transformation suggested there but I am getting this warning: 

WARNING: Ordinary missing values were found or an UNTIE transformation or the UNTIE= option was
specified. The utility of the hypothesis tests are dubious since one parameter must be
estimated for each of these values. If you really want to do this, ensure that no
observations are duplicated -- combine duplicate observations and use a FREQ statement.
If you do not, the parameter count may be too large and the tests overly conservative.
However, it is best to avoid this situation altogether.

 

This is the code I am using:

proc transreg data=one ss2 details plots=(boxcox);
Where Variety='BL516';
model boxcox(AvgFirm/convenient lambda=-2 to 2 by 0.1) = identity(Weeks);
output out=TransOut residual;
run;

 

It may be due to unbalanced data (dep var was not measured at all months every season), not enough observations for the dependent variable (only 5 obs at each level of Weeks), and levels of the indep var Weeks are not spaced equally (week 0, 1, 3, and 6).

 

Thank you very much

Rick_SAS
SAS Super FREQ

The error message indicates that the problem might be related to missing responses for duplicate explanatory variables.  Study the following example.  I don't have your data, but if they look like the following, delete the observations that have missing responses. These obs don't affect the model fit in any case.

 

data test;
input AvgFirm Weeks;
datalines;
1 1
2 3
3 4
. 4
;

proc transreg data=test ss2 details plots=(boxcox);
model boxcox(AvgFirm/convenient lambda=-2 to 2 by 0.1) = identity(Weeks);
output out=TransOut residual;
run;
palolix
Quartz | Level 8

It worked! Thank you so much Rick_SAS!! I only had 3 missing values in total so I didn't think it was going to be so problematic. 

Now I am puzzled after realizing that the distribution of the residuals doesn't improve after the transformation.

Before:

 
palolix_8-1730227117237.png

 After fitting the linear reg model using the new response var:

 
palolix_9-1730227173052.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Rick_SAS
SAS Super FREQ

Those diagnostic panels look like PROC REG. Please post your complete SAS program, including calls to PROC REG, PROC TRANSREG, and DATA steps.  Remember to use the "Running Man icon" (Insert SAS Code) so that the SAS code is nicely formatted.

palolix
Quartz | Level 8

Ok, sure. Here are the data steps:

 

/*Original Linear reg model for BL516  */
proc reg data=one;
Where Variety='BL516';
model AvgFirm=Weeks/dw clb;
run;

 

*If residuals not normally distributed perform box-cox transformation;
 
proc transreg data=one ss2 details plots=(boxcox);
Where Variety='BL516';
    model boxcox(AvgFirm/convenient lambda=-2 to 2 by 0.1) = identity(Weeks);
output out=TransOut residual;
run;
 

*create new dataset that uses box-cox transformation to create new y;

data new_one;
set one;
Where Variety='BL516';
new_AvgFirm = (AvgFirm**(1.4) - 1) / 1.4; /*The output from previous step tells me that the selected value to use for lambda is 1.4*/
run;


*fit simple linear regression model using new response variable;
proc reg data=new_one;
Where Variety='BL516';
model new_AvgFirm = Weeks;
run;

 

 
Rick_SAS
SAS Super FREQ

Just a few last remarks:

  1. The R-square and adjusted R-square values are better for the new_AvgFirm variable, so the B-C transform did improve the fit, but only by a little.
  2. The original residuals indicate homoscedasticity (compare the range of the residuals for each week) and an extreme outlier. You might want to double-check the response value for the obs that has a high Cook's D value.
  3. You are treating WEEKS as continuous. The Residual-by-predicted-value plot indicates that there is a trend that you are not capturing in the model. You might want to add a WEEKS**2 term or treat WEEKS as categorical (see below).
  4. The diagnostics plots indicate that WEEKS might be better modeled as a CLASS variable. That will give you more parameters but should result in a better model. You probably don't need to use B-C at all if you change from PROC REG to PROC GLM and treat WEEKS as categorical.

In summary, I don't think your problem is the B-C transformation. I think you have other modeling issues. Good luck with your project.  

palolix
Quartz | Level 8

Thank you so much for your great support on this! That makes a lot of sense so I will remove that outlier and switch to proc glm to treat weeks as a categorical variable. 

 

Thanks a lot!

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 899 views
  • 6 likes
  • 2 in conversation