BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Hank
Fluorite | Level 6

Hi,


In a multiple regression model with cost data as dependent variable (Y), I have used proc transreg (model BoxCox)  in SAS to get the proper Box-Cox transformation of Y (in order for the residuals to be normally distributed).

model BoxCox(Y) = identity(x1 x2 x3 x4 x5 x6)

The result was lambda= -0,25. So I transform my dependent with formula:

(((Y**(-0.25))-1) / (-0.25))

and run a proc reg, with the Box-Cox transformed dependent variable and my independent variables. I have read that the back-transformation (inverse) of Box-Cox is:

x = (lambda*z + 1)^(1/lambda),

where z is the transformed variable and lambda = -0,25 in my case.

How do I interpret the coefficients and standard errors from the proc reg?

Do I back-transform all the beta-coefficients?

For example, one of my significant variables has a beta= -0,01068 with standard errors= 0,00326.

How do I interpret that? Any feedback/comment much appreciated Smiley Happy

Best regards,

Hank

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

For each unit change in the x variable, the transformed Y variable decreases by -0.01068.  Since this is a non-linear transform, you should plug in low, median and high values for Y to get some idea of how the Y variable decreases in response to changes in the X variable.

Steve Denham

View solution in original post

17 REPLIES 17
SteveDenham
Jade | Level 19

For each unit change in the x variable, the transformed Y variable decreases by -0.01068.  Since this is a non-linear transform, you should plug in low, median and high values for Y to get some idea of how the Y variable decreases in response to changes in the X variable.

Steve Denham

Hank
Fluorite | Level 6

Thanks for the help, much appreciated Smiley Happy       I have a similar question maybe you can help to clarify.

If the transformations are only in a subset of the independent variables, say half of the x-variables are transformed with the square root. How do I interpret that, in relation to the response variable? Shall I back-transform the beta-coefficients of the x-var. first and use these transformed values in relation to the response variable?


On to the final question. I found another post by you (https://communities.sas.com/message/125380#125380) where you wrote:

"If you are working on developing a predictive equation with only a single predictor, take a good look at PROC TRANSREG. This would enable you to model the dependent variable as a logit, and the independent variable in a variety of ways--class, optimal transforms, non-optimal transforms, nonlinear transforms (such as Box-Cox or penalized B-splines)."

What if I do different transformations on both the y- and the x-variables, as a rule of thumb (if there is any), how can I think when I am about to translate the result (beta-coeff. and s.e.) into original terms?

For example, if the response (y) are transformed via Box-Cox and the rest of the variables via the logit transformation, how would I go about to translate the results?

Best regards,

Hank

SteveDenham
Jade | Level 19

I really try not to think of the relationship on the original scale for both independent and dependent variables.  The transformed data are the ones that show a relationship, and only if it is a linear transformation is the original scale meaningful for the coefficients.  If someone has a more non-linear worldview, maybe they can visualize what the coefficient might mean after back-transforming both sides of the equation.  To me, the only way to see this would be to plug in multiple values for the independent variable and see what happens.

Steve Denham

Rick_SAS
SAS Super FREQ

I agree with Steve. On the other hand,you are always free to use the chain rule if you want to slog through the computations.

If you've transformed Y -> F(Y) and X -> g(X) and found that F(Y)=alpha+beta*g(X), then take derivatives wrt X of both sides:

df/dy * dy/dx = beta * dg/dx

which means that

dy/dx = beta * (dg/dx) / (df/dy)

Hank
Fluorite | Level 6

@ Rick and Steve: Thanks a lot for the feed-back.

As a general thought, would you consider using maybe proc nlin to estimate a regression model, instead of trying to fit data to proc reg by transformations? Smiley Happy

Best regards,

/Hank

Rick_SAS
SAS Super FREQ

I wouldn't. Least squares regression has many nice properties, including being able to estimate many coefficients without worrying about convergence of some optimization algorithm. If it makes sense to do a linear analysis, I grab that opportunity.

SteveDenham
Jade | Level 19

I love non-linear regression, but I prefer to have a known process that might be generating the nonlinear response.  Without knowing the process, and reflecting on what you have given here, I assume that you are in an exploratory mode.  Using TRANSREG to identify significant relationships that can be linearized is bound to be more productive.  It also opens up semi-parametric methods (splines) that are generally not used enough, in my opinion.

If at all possible, get a copy of Frank Harrell's Regression Modeling Strategies for some good approaches.

Steve Denham

Hank
Fluorite | Level 6

Thanks once again for the input, and the book recommendation. Smiley Happy

I read more about the proc transreg procedure, and an example here:

SAS/STAT(R) 9.2 User's Guide, Second Edition

In my data, I have cost data as the response (which in the litterature usually is log-transformed) and a variety of continuous and discrete explanatory variables. Some are dummy and others are values from 0-100, often with a concentration of values near the range of 70-100 or 0.

As I wrote in the initial question, I did a Box-Cox transformation of the response:

model BoxCox(Y) = identity(x1 x2 x3 x4 x5 x6)

This generated a model in which the residuals are normally distributed, but R2 is not as high as I think it could be, and my fear is that I could miss some relationships that are nonlinear in the explanatory variables.

Thanks to yours, and Ricks, excellent help, I`m now thinking of doing something like the example in the link above (as shown below):

-Namely using mspline in the response and spline on my predictors that range from 0-100.

But the nice interpretation breaks down, and to explain the coefficients in one way or the other (for myself as well) is harder. Showing changes in low/mid/high values of

the response from a change in (original value of x) is still a quite good way of explaining the relationship to non-professionals. But with different transformations on the independent variables (as shown below), now I find it really difficult to even say anything about which independent variable that explains the most, and the ratio of its affect on the response in relation to the other independent variables.

Do you have any idea on how to make sense of the different relationships when explained to non-statisticans? 

Best regards,

Hank

******************************************************************** example from sas homepage ********************************************

* Fit the Nonparametric Model;

  proc transreg data=Gas solve test nomiss plots=all;

  ods exclude where=(_path_ ? 'MV');

  model mspline(NOx / nknots=9) = spline(EqRatio / nknots=9)

  monotone(CpRatio) opscore(Fuel);

  run;

Intercept1-15.27464957.133857.13381227.60<.0001Intercept
Pspline.EqRatio_1135.10291462.747862.74781348.22<.0001Equivalence Ratio (PHI) 1
Pspline.EqRatio_21-19.38646864.643064.64301388.94<.0001Equivalence Ratio (PHI) 2
Identity(CpRatio)10.0320581.44451.444531.04<.0001Compression Ratio (CR)
Opscore(Fuel)50.1583885.56191.112423.90<.0001Fuel
SteveDenham
Jade | Level 19

Hmm.  Time to back away from the splines, at least for now.

Let's go back to the original Box-Cox transformation. with lambda=-0.25.  What does that imply as a transfomation?  First, it is negative, so there is an inverse transformation, and second the absolute value is 0.25, which is taking a square root twice.  Thus, I would expect that the original distribution of Y is such that there are a LOT of values near zero, with a sharp drop off as you move to the right, and that the distribution probably "stops" at some value.  Is that anything close to correct?

Now, you say that you believe the Rsquared for your model is "not as high as you think it could be."  There could be a couple of reasons for that.  First, there may be more noise in your data than you thought.  Second, you may be missing a key variable, or a key interaction between the variables you do have.  This is where subject knowledge MUST be used in specifying your model.

Steve Denham

Hank
Fluorite | Level 6

Hi again,

Thanks for your willingness to help, it means a lot. Smiley Happy

Above, I have posted the histogram from my original response-data, and the result from the box cox-transformation. As you can see, the response looks near-normal but is positively skewed. The real mess in the data lies in the independent variables, as indicated from the box-cox-picture aboce with lots of almost flat curves/lines.

First, I tried to log-transform the response, but it failed all the normality-tests of the regression residuals. This transformation (the Box-Cox) are normal in residuals.

Due to the many almost flat curves in the Box-Cox picture, I was thinking that maybe spline-regression in the predictors would do the trick. What is your thoughts from the pictures above? Best regards, Hank

SteveDenham
Jade | Level 19

I begin to see why the Rsquared isn't what you had hoped.  Those flat lines indicate that there doesn't seem to be much of a relationship between these variables and the (transformed) response.  At this point, splines might be an approach, since it looks like a fishing expedition.  There may be some linear combination of the predictors that has a relationship with the response.  However, splines are generally fit within a predictor that is, well, "clumpy" (I'm sure that is a real statistical term in some universe).  All I see are flat lines--like western Kansas flat lines:smileygrin:.  Any "hill" at all will be the driver of a fit.

I have an idea, but am not real sure of a theoretical basis to do it.  Suppose you Box-Cox transform your response variable, and then try PROC PLS and see if there are a limited number of "hidden components" based on the predictors.  Plus you get a cross-validation fit.

Steve Denham

Hank
Fluorite | Level 6

Hi,

I seems to have missed your latest post. I will look up proc pls asap tomorrow (its friday afternoon in this part of the world, and there is a world outside of econometric modelling, sometimes..). It seems like an interesting way to estimate the predictive partial least squares. Your illustrative example of the relationship between those flat lines and splines helped me a lot in understanding how both the models work (as well as it explains the high fit of my spline-model described above). Thanks Smiley Happy

Best regards,

Hank

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 17 replies
  • 13095 views
  • 8 likes
  • 4 in conversation