BookmarkSubscribeRSS Feed
Top_Katz
Quartz | Level 8

Hi!  I am trying to understand the univariate Box Cox Transformation.  I saw from @Rick_SAS blog post ("The univariate Box-Cox transformation") that SAS PROC TRANSREG computes this through a regression with a zeroed independent variable, and saw this confirmed by @WarrenKuhfeld in another discussion on this site ("Univariate Box-Cox Transformation").  So, for each lambda you try on the dependent variable, you'd get an intercept and a set of residuals (Is that the case?).  How do you decide which lambda to choose?

 

I have seen the following description of univariate Box Cox on another commercial software package site:

"The Box-Cox transformation estimates lambda values that minimize the standard deviation of W, a standardized transformed variable."

(Not naming names, but think of Mickey Mouse's girlfriend and a once-popular diet cola.)

 

So, is finding the lambda that minimizes the standard deviation of the transformed variable equivalent to what PROC TRANSREG does to find lambda?  The minimum standard deviation would provide the shallowest sloping normal line in the Q-Q plot, would it not?  But would that create the closest fit of the residuals to the normal line?  Does PROC TRANSREG try to find the closest fit of the residuals to the Q-Q plot normal line?

 

Thanks for any insight you provide!

14 REPLIES 14
WarrenKuhfeld
Ammonite | Level 13
I wrote that code decades ago and have since retired. I can't say I recall details beyond those in the documentation. https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.4/statug/statug_transreg_details02.htm

 

Top_Katz
Quartz | Level 8

Hi @WarrenKuhfeld !  Thank you for the quick response.  It's helpful to be told where to look by someone who knows.  Much appreciated.

Rick_SAS
SAS Super FREQ

No, it has nothing to do with minimizing standard deviation or fitting a Q-Q plot. In the article, "The Box-Cox transformation for a dependent variable in a regression," I say that the Box-Cox transformation finds the lambda value that maximizes the likelihood between the transformed data and the normal log-likelihood function for the residuals. (In practice, we might prefer a "convenient" value of lambda rather than the optimal value.)

 

That is, for each lambda, you compute the residuals and use them to evaluate the normal log-likelihood function. For all the lambda values you try, pick the one for which the log-likelihood is maximal.

 

Please reread the article, "The univariate Box-Cox transformation" and pay particular attention to the graph labeled "Box-Cox Analysis for <varname>". In the two paragraphs after the graph, I explain that the univariate Box-Cox transformation optimizes the normal log-likelihood function. 

Top_Katz
Quartz | Level 8

Hi @Rick_SAS !  Thank you for the quick response.  And for pointing out where to look.  It seems pretty clear now, but I somehow didn't realize what it meant when I first saw it.  Much appreciated!

 

Okay, now, apologies if this is a dumb follow-up question, but doesn't finding the lambda that maximizes the normal log likelihood of an intercept only regression amount to the same thing as minimizing the sum of squares of the residuals around the average of the lambda-transformed data values?  If so, would that not be the lambda that produces the minimum standard deviation (of the lambda-transformed data)?

 

Thanks!

Rick_SAS
SAS Super FREQ

>  doesn't finding the lambda that maximizes the normal log likelihood of an intercept only regression amount to the same thing as minimizing the sum of squares of the residuals around the average of the lambda-transformed data values?  If so, would that not be the lambda that produces the minimum standard deviation (of the lambda-transformed data)?

 

No. The BC transformation does not preserve the scale of the data. The transformed data can have more or less variance than the original data, so there is no relationship between the variance of the transformed data and the original variance. The BC transform tries to make residuals normal, which is quite different from making the std dev small.

 

I think this is easiest to see by using an example. Suppose your original data are on the order of 10 with variance greater than 1. Then

  • For lambda=2, the BC-transformed data are on the order of 50. The transformed variance will be bigger than the original variance.
  • For lambda=-2, the BC-transformed data are on the order of 0.5. The transformed variance will be smaller than the original variance.

Thus, for the example, the variance increases when lambda > 1 and decreases when lambda < 1. For these data, there is no lambda that optimizes the StdDev of the transformed data, since the variance approaches 0 as lambda -> -infinity.

 

data Have;
input Y @@;
Ym2 = (Y**(-2) - 1) / (-2);  /* BC lambda=-2 */
Y2  = (Y**2 - 1) / 2;        /* BC lambda= 2 */
_zero = 0;
datalines;
10.2 9 8.6 11.6 9.2 12.5 9 8 11 10.9
;

proc means data=Have mean std;
   var Ym2 Y Y2;
run;

proc transreg data=Have details maxiter=0 nozeroconstant;
   model BoxCox(Y / convenient lambda=-2 to 2 by 0.01) = identity(_zero);
run;
Top_Katz
Quartz | Level 8

Hi @Rick_SAS !  Thank you so much for responding.  I ran your example, but I guess it would help me to see the equations, because when I tried to replicate some of the log-likelihood values by hand, I got very different results.  First of all, is this the normal log likelihood formula?

 

-(N * (0.5) * ln(2 * pi))  -(N * ln(sigma))  -((0.5 * sum((x[j] - mu)**2) / (sigma**2))

 

where x[j] are the data values, N is the number of values, and mu and sigma are the parameters to determine.

 

Then, if mu = (sum(x[j]) / N)  and  sigma_ml = sqrt(sum((x[j] - mu)**2) / N), when you substitute back into the log likelihood you get:

 

-(N * (0.5) * ln(2 * pi))  -(N * ln(sigma_ml))  -(0.5 * N)

 

In this case, for each lambda, the x[j] values are the lambda transformed values of the original set of x values.  I tried the logpdf function in SAS, and got numbers corresponding to the formula I gave above.  But that isn't what came out of PROC TRANSREG for your example.  For your original set of Y values, I get ll = -17.4666119.  But PROC TRANSREG says ll = -3.80403 for lambda = +1.  For your Ym2, I got ll = 51.408472367 and for your Y2, I got ll = -40.72112937.  As you remark, the BC formula without the geometric mean term puts the data on different scales for different lambdas.  But if I use the geometric mean term in the BC formula for your example data, for lambda = -2 I get ll = -17.38343536 and for lambda = +2 I get ll = -17.79049346, which are in the same ballpark.  For lambda = -2 PROC TRANSREG says ll = -3.72085 and for lambda = +2, PROC TRANSREG says ll = -4.12791.  My code is below.

 

When you look at the log likelihood formula with the maximum likelihood parameter value substitutions, it seems clear that the lambda that minimizes the value of sigma_ml will maximize the log likelihood.  So, what am I getting wrong?

 

 

data	handy	;
	set	have	end	=	last	;
	keep	n_obs	n_fact	y_gmn	lln	
					y_mean	y_std	y_stdn	y_ll	y_lln
					ym_mean	ym_std	ym_stdn	ym_ll	ym_lln
					yp_mean	yp_std	yp_stdn	yp_ll	yp_lln
					yl_mean	yl_std	yl_stdn	yl_ll	yl_lln
					yml_mean	yml_std	yml_stdn	yml_ll	yml_lln
					ypl_mean	ypl_std	ypl_stdn	ypl_ll	ypl_lln
			;
	array	ty{1:10}	_temporary_	;
	array	tym{1:10}	_temporary_	;
	array	typ{1:10}	_temporary_	;
	array	tyl{1:10}	_temporary_	;
	array	tyml{1:10}	_temporary_	;
	array	typl{1:10}	_temporary_	;
	ty{_N_}		=	Y	;
	tym{_N_}	=	Ym2	;
	typ{_N_}	=	Y2	;
	if	last	then	do	;
		n_obs	=	_N_	;
		n_fact	=	sqrt((n_obs-1)/n_obs)	;
		y_gmn	=	geomean(of	ty{*})	;
		lln		=	n_obs	*	log(2	*	constant("pi"))	/	2	;
		do	ix	=	1	to	n_obs	;
			tyl{ix}	=	ty{ix}	-	1	;
			tyml{ix}	=	((ty{ix})**(-2)	-	1)	/	(-2	*	(y_gmn)**(-3))	;
			typl{ix}	=	((ty{ix})**(2)	-	1)	/	(2	*	(y_gmn)**(+1))	;
		end	;
		y_mean	=	mean(of	ty{*})	;
		y_std	=	std(of	ty{*})	;
		y_stdn	=	y_std	*	n_fact	;
		ym_mean	=	mean(of	tym{*})	;
		ym_std	=	std(of	tym{*})	;
		ym_stdn	=	ym_std	*	n_fact	;
		yp_mean	=	mean(of	typ{*})	;
		yp_std	=	std(of	typ{*})	;
		yp_stdn	=	yp_std	*	n_fact	;
		yl_mean	=	mean(of	tyl{*})	;
		yl_std	=	std(of	tyl{*})	;
		yl_stdn	=	yl_std	*	n_fact	;
		yml_mean	=	mean(of	tyml{*})	;
		yml_std		=	std(of	tyml{*})	;
		yml_stdn	=	yml_std	*	n_fact	;
		ypl_mean	=	mean(of	typl{*})	;
		ypl_std		=	std(of	typl{*})	;
		ypl_stdn	=	ypl_std	*	n_fact	;
		y_ll	=	0	;
		y_lln	=	0	;
		ym_ll	=	0	;
		ym_lln	=	0	;
		yp_ll	=	0	;
		yp_lln	=	0	;
		yl_ll	=	0	;
		yl_lln	=	0	;
		yml_ll	=	0	;
		yml_lln	=	0	;
		ypl_ll	=	0	;
		ypl_lln	=	0	;
		do	ix	=	1	to	n_obs	;
			y_ll	+	logpdf('NORMAL',ty{ix},y_mean,y_std)	;
			y_lln	+	logpdf('NORMAL',ty{ix},y_mean,y_stdn)	;
			ym_ll	+	logpdf('NORMAL',tym{ix},ym_mean,ym_std)	;
			ym_lln	+	logpdf('NORMAL',tym{ix},ym_mean,ym_stdn)	;
			yp_ll	+	logpdf('NORMAL',typ{ix},yp_mean,yp_std)	;
			yp_lln	+	logpdf('NORMAL',typ{ix},yp_mean,yp_stdn)	;
			yl_ll	+	logpdf('NORMAL',tyl{ix},yl_mean,yl_std)	;
			yl_lln	+	logpdf('NORMAL',tyl{ix},yl_mean,yl_stdn)	;
			yml_ll	+	logpdf('NORMAL',tyml{ix},yml_mean,yml_std)	;
			yml_lln	+	logpdf('NORMAL',tyml{ix},yml_mean,yml_stdn)	;
			ypl_ll	+	logpdf('NORMAL',typl{ix},ypl_mean,ypl_std)	;
			ypl_lln	+	logpdf('NORMAL',typl{ix},ypl_mean,ypl_stdn)	;
		end	;

		yl_ll_h		=	-lln	-(0.5*(n_obs-1))	-	(n_obs*log(yl_std))	;
		yl_lln_h	=	-lln	-(0.5*n_obs)	-	(n_obs*log(yl_stdn))	;
		yml_ll_h	=	-lln	-(0.5*(n_obs-1))	-	(n_obs*log(yml_std))	;
		yml_lln_h	=	-lln	-(0.5*n_obs)	-	(n_obs*log(yml_stdn))	;
		ypl_ll_h	=	-lln	-(0.5*(n_obs-1))	-	(n_obs*log(ypl_std))	;
		ypl_lln_h	=	-lln	-(0.5*n_obs)	-	(n_obs*log(ypl_stdn))	;

		put	n_obs=	n_fact=	y_gmn=	lln=	;
		put	y_mean=	y_std=	y_stdn=	y_ll=	y_lln=	;
		put	ym_mean=	ym_std=	ym_stdn=	ym_ll=	ym_lln=	;
		put	yp_mean=	yp_std=	yp_stdn=	yp_ll=	yp_lln=	;
		put	yl_mean=	yl_std=	yl_stdn=	yl_ll=	yl_lln=	;
		put	yml_mean=	yml_std=	yml_stdn=	yml_ll=	yml_lln=	;
		put	ypl_mean=	ypl_std=	ypl_stdn=	ypl_ll=	ypl_lln=	;
		put	yl_ll_h=	yl_lln_h=	;
		put	yml_ll_h=	yml_lln_h=	;
		put	ypl_ll_h=	ypl_lln_h=	;
		output	;
	end	;
run	;

title2	"proc print data = &syslast."	;
proc	print	data	=	&syslast.	;
run	;
title2	;

 

Rick_SAS
SAS Super FREQ

You originally asked if the optimal lambda for Box-Cox transformation minimizes the StdDev of the transformed data. I believe the answer is no, and I provided an example. Your latest question seems to be trying to verify the LL computation in PROC TRANSREG. I regret that I don't have time right now to read through your DATA step code. Perhaps someone else will assist.

 

If you want to evaluate loglikelihood functions in SAS, I think there are better ways than writing a DATA step. I recommend the following techniques:

Be aware that statistical software sometimes omits the constant term when reporting the LL values. In the formula for the normal LL, the term 
-N/2 * ln(2 * pi)
is a constant. It can be dropped without affecting the location of the optimal (mu, sigma) values. I don't know if TRANSREG does this, but it is fairly common in numerical optimization. 

 

Good luck!

Top_Katz
Quartz | Level 8

Hi @Rick_SAS !  Thank you once again for taking the time to assist me.  Mucho appreciated, and I totally understand if you're done with this for now.

 

I would like to post some follow-up.

 

1.  You mentioned my "original question" and my "latest question."  Sorry if I was unclear, but my question didn't really change.  I'm still trying to figure out whether Box-Cox is equivalent to minimizing the standard deviation.  I'm now pretty sure that it IS equivalent, once we understand which standard deviation we are referring to:  the STD for the BC formula WITH THE GEOMETRIC MEAN.

 

2.  No, my log likelihood formula was not missing a factor of 1/2.  It's just that I wrote it a little differently than you usually see it.  I wrote:   -(N * ln(sigma) ) 

Usually you see:  -(N * 0.5 * ln(sigma**2)). 

They're the same.  And my formula produced the same numbers as the SAS LOGPDF function did.

 

3.  I don't think your example proves what you think it does concerning standard deviation.  You didn't include the geometric mean term in your BC formula.  Not everyone does.  It's just a constant.  So, you're correct about the standard deviation without the geometric mean term.  But if you don't somehow include the geometric mean in the search for lambda I don't think you'll get the right result.  If you include the geometric mean term, then the lambda that maximizes the log likelihood is also the lambda that minimizes the maximum likelihood standard deviation estimate (which differs from the usual unbiased STD estimator by using N instead of N-1).  When I run your data by hand, I still get different values of the log likelihood than PROC TRANSREG reports, but the maximum occurs at -0.68, just like PROC TRANSREG found.  I don't think that's a coincidence.  If I run without the geometric mean, the maximum likelihood over the range from -2 to +2 occurs at -2, and so does the minimum standard deviation.  I don't think that's the correct BC lambda, and if we tried a wider range we'd get an even lower value.

 

This has been a really helpful discussion for me.  I still would like to find out why my log likelihood numbers are different from the ones in PROC TRANSREG.  Perhaps the references you provided will help there.  And I see that @WarrenKuhfeld provided the reference for where he got the method, so I'll try to locate that one.  My code to run log likelihoods from lambda -2 to +2 is below, both with and without the geometric mean term.

 

data	have_llln_search_2	;
	set	have	end	=	last	;
	keep	lambda	lambda_mean	lambda_std_ml	lambda_ll_n	lambda_n_logsigml
					lambda_u_mean	lambda_u_std_ml	lambda_u_ll_n	lambda_u_n_logsigml	;
	array	ty{1:10}	_temporary_	;
	array	tyul{1:10}	_temporary_	;
	array	tyl{1:10}	_temporary_	;
	ty{_N_}		=	Y	;
	if	last	then	do	;
		n_obs	=	_N_	;
		n_fact	=	sqrt((n_obs-1)/n_obs)	;
		y_gmn	=	geomean(of	ty{*})	;
		lln		=	n_obs	*	log(2	*	constant("pi"))	/	2	;
		min_yl_stdn	=	1000	;
		max_yl_lln	=	-1000	;
		lambda_minimize	=	999	;
		lambda_maximize	=	999	;
		min_yul_stdn	=	1000	;
		max_yul_lln	=	-1000	;
		lambdau_minimize	=	999	;
		lambdau_maximize	=	999	;
		do	lambda	=	-2	to	+2	by	0.01	;
			if	lambda	ne	0	then	do	;
				do	ix	=	1	to	n_obs	;
					tyl{ix}		=	((ty{ix})**(lambda)	-	1)	/	(lambda	*	(y_gmn)**(lambda-1))	;
					tyul{ix}	=	((ty{ix})**(lambda)	-	1)	/	lambda	;
				end	;
			end	;
			else	do	;
				do	ix	=	1	to	n_obs	;
					tyl{ix}		=	(log(ty{ix})	/	y_gmn)	;
					tyul{ix}	=	log(ty{ix})	;
				end	;
			end	;
			yl_mean	=	mean(of	tyl{*})	;
			yl_std	=	std(of	tyl{*})	;
			yl_stdn	=	yl_std	*	n_fact	;
			yl_lln	=	0	;
			yul_mean	=	mean(of	tyul{*})	;
			yul_std		=	std(of	tyul{*})	;
			yul_stdn	=	yul_std	*	n_fact	;
			yul_lln		=	0	;
			do	ix	=	1	to	n_obs	;
				yl_lln	+	logpdf('NORMAL',tyl{ix},yl_mean,yl_stdn)	;
				yul_lln	+	logpdf('NORMAL',tyul{ix},yul_mean,yul_stdn)	;
			end	;
			lambda_mean		=	yl_mean	;
			lambda_std_ml	=	yl_stdn	;
			lambda_ll_n		=	yl_lln	;
			lambda_n_logsigml	=	-n_obs	*	log(yl_stdn)	;
			lambda_u_mean		=	yul_mean	;
			lambda_u_std_ml		=	yul_stdn	;
			lambda_u_ll_n		=	yul_lln	;
			lambda_u_n_logsigml	=	-n_obs	*	log(yul_stdn)	;
			output	;
			if	(yl_stdn	<	min_yl_stdn)	then	do	;
				min_yl_stdn		=	yl_stdn	;
				lambda_minimize	=	lambda	;
			end	;
			if	(yl_lln		>	max_yl_lln)	then	do	;
				max_yl_lln		=	yl_lln	;
				lambda_maximize	=	lambda	;
			end	;
			if	(yul_stdn	<	min_yul_stdn)	then	do	;
				min_yul_stdn		=	yul_stdn	;
				lambdau_minimize	=	lambda	;
			end	;
			if	(yul_lln		>	max_yul_lln)	then	do	;
				max_yul_lln		=	yul_lln	;
				lambdau_maximize	=	lambda	;
			end	;
		end	;
		put	min_yl_stdn=	lambda_minimize=	;
		put	max_yl_lln=		lambda_maximize=	;
		put	min_yul_stdn=	lambdau_minimize=	;
		put	max_yul_lln=	lambdau_maximize=	;
	end	;
run	;

title2	"proc print data = &syslast."	;
proc	print	data	=	&syslast.	;
run	;
title2	;

 

WarrenKuhfeld
Ammonite | Level 13

I don't have access to Draper and Smith any more, and I am not going to get deeply involved in it, but I did want to point out the statement in the documentation to which I referred you in case it helps:

 

To divide by the geometric mean, specify the GEOMETRICMEAN t-option.

Top_Katz
Quartz | Level 8

Hi @WarrenKuhfeld !  Thank you for the great tip.  I ran @Rick_SAS 's PROC TRANSREG example with and without the geometricmean option, and printed out the output.  The lambda is the same and all the log likelihood values are the same, so internally it seems to be doing the same thing to arrive at the optimal lambda.  The only thing different is the set of final transformed Y values, which differ by the geometric mean constant (raised to the appropriate power).

WarrenKuhfeld
Ammonite | Level 13

The reference stated in the documentation, Draper and Smith 1981, pp. 225–226, is where I got the method from. Consult that source if you want to reproduce TRANREG's results.

Top_Katz
Quartz | Level 8

Hi @WarrenKuhfeld !  Thank you for the reference.  Although my log likelihood numbers are different from the ones in PROC TRANSREG, I'm at least reassured that on the example I ran, I got the same answer for the BC lambda as the one from PROC TRANSREG.

Rick_SAS
SAS Super FREQ

I found the Minitab reference that the OP quoted:
Methods and formulas for Box-Cox Transformation - Minitab

If you want to investigate their claims, you should use their formulas. The formulas in my example are NOT the ones that Minitab uses. 

Top_Katz
Quartz | Level 8

Hi @Rick_SAS !  Thank you for following up.  Yes, it looks like the reference that mentioned standard deviation uses the geometric mean in their BC formula, and your example didn't.  Interestingly enough, in your blog post from last year, you added the geometricmean option on your PROC TRANSREG statement.  I think that only affects the final transformed values of Y, but not the way PROC TRANSREG computes the optimal lambda.  I'll have to see if I can find the Draper and Smith paper the PROC TRANSREG code is based on.  They don't seem to be using the log likelihood for either the geometric mean version or the non-geometric mean version of the BC formula.  When I used the geometric mean BC formula log likelihood on your example data, I got the same lambda result as PROC TRANSREG (-0.68), and the maximum log likelihood occurred for the same lambda (again, -0.68) as the minimum standard deviation (computed using N rather than N-1).

 

By the way, in my data step code I got the lambda = 0 term wrong, dividing by the geometric mean when I should have multiplied.  I got a discontinuity in the output.  But it wasn't because of that error -- my code never got to the lambda = 0 subsection, because the loop of -2 to +2 by 0.01 is never exactly zero in SAS, so it must have used the power law for a very small absolute value of lambda.  Live and learn :-).

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 1877 views
  • 7 likes
  • 3 in conversation