How Correlation Relates to Linear Regression and Factor Analysis

2 Likes

The purpose of this blog is to illustrate the relationship between a Pearson correlation coefficient and a slope parameter from a simple regression model.

How Does Correlation Relate to Linear Regression (and Factor Analysis)

For well over a decade, I’ve been teaching a course in multivariate methods, which begins with a discussion of both principal components analysis (PCA) and Exploratory Factor Analysis (EFA). Most of the course participants have been comfortable with PCA, but when it comes to factor analysis, they often feel challenged. I show them all the related matrix algebra, but I know that’s not really the problem. Any basic statistics text can provide that. I want them to understand conceptually what is actually happening when we say that we “infer factors from an observed covariance matrix”.

The matrix algebra of exploratory factor analysis looks exactly like that of linear regression and you can think of exploratory factor analysis as a series of simultaneous regression models. However, it is not so obvious why correlation matrices in factor analysis have anything to do with the implied regression models. So, I’m going to present here the first step in a two-step approach to understanding exploratory factor analysis - an explanation of the relationship between regression coefficients and Pearson correlation coefficients. The beauty of presenting it this way is that even if you don’t care about factor analysis at all, but only want to understand linear regression a bit better, there’s something here for you. I will mostly avoid mathematical formulas because, as I said, you can find those anywhere. Instead, I’ll describe concepts and use various statistical procedures in SAS^® to explain my points.

Let me start with a fairly simple set of regression models. I’ll be using the baseball data set in the sashelp library. I’ll regress the variable Salary on explanatory variables, nRBI (runs batted in) and nHome (number of home runs), using data from American Major League Baseball in the 1986 season. It’s not important that you know what those measures are, but it might make it a bit more interesting if you do.

Most of my course participants are aware that linear regression is related to Pearson correlations, but they might have forgotten how. I’ll start with a correlation matrix of all three variables to be used in my regression models. Documentation for PROC CORR can be found here.

proc corr data=sashelp.baseball

nosimple;

var Salary nRBI nHome;

run;

Pearson Correlation Coefficients Prob > \|r\| under H0: Rho=0 Number of Observations
	Salary	nRBI	nHome
Salary 1987 Salary in $ Thousands	1.00000 263	0.51723 <.0001 263	0.39885 <.0001 263
nRBI RBIs in 1986	0.51723 <.0001 263	1.00000 322	0.85394 <.0001 322
nHome Home Runs in 1986	0.39885 <.0001 263	0.85394 <.0001 322	1.00000 322

I’ll just mention the fact that the variables are all correlated among each other to one degree or another.

Now I’ll regress Salary on each of the explanatory variables in separate models and just show tables relevant to this discussion. PROC REG documentation can be found here.

proc reg data=sashelp.baseball;

model Salary=nRBI;

run;

Root MSE	386.82654	R-Square	0.2675
Dependent Mean	535.92588	Adj R-Sq	0.2647
Coeff Var	72.17911

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	61.43175	54.13626	1.13	0.2575
nRBI	RBIs in 1986	1	9.07446	0.92941	9.76	<.0001

proc reg data=sashelp.baseball;

model Salary=nHome;

run;

Root MSE	414.47484	R-Square	0.1591
Dependent Mean	535.92588	Adj R-Sq	0.1559
Coeff Var	77.33809

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	294.86052	42.78033	6.89	<.0001
nHome	Home Runs in 1986	1	20.38591	2.90120	7.03	<.0001

At this point, it’s not entirely obvious how the regression results are related to the correlation results, but we’ll get there. I promise. Let me just point out that each of the explanatory variables, nRBI and nHome, have p-values <.0001, as seen in the parameter estimates tables for their respective models. These values would be considered “statistically significant” in most organizations. The next step is to combine the models by including both measures in a single model.

proc reg data=sashelp.baseball;

model Salary=nRBI nHome;

run;

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	14600270	7300135	49.02	<.0001
Error	260	38718842	148919
Corrected Total	262	53319113

Root MSE	385.89976	R-Square	0.2738
Dependent Mean	535.92588	Adj R-Sq	0.2682
Coeff Var	72.00618

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	34.66715	56.87140	0.61	0.5427
nRBI	RBIs in 1986	1	11.33615	1.76860	6.41	<.0001
nHome	Home Runs in 1986	1	-7.73751	5.15245	-1.50	0.1344

At this point, some questions arise. Why are the regression coefficients (parameter estimates) for nRBI and nHome so very different from what they were in the individual models? In addition, why is the coefficient for nHome now negative? This seems to indicate that for each extra home run a player hits, his expected salary is reduced by over $7,700.00! Wow, if this were really true, I imagine we’d never see another home run hit in Major League Baseball^®. However, the p-value for this estimate is so high, few researchers would consider this value to be “statistically significant”. Still, this is perplexing. The standard explanation is the parameter estimates and p-values are estimated separately for each explanatory variable adjusting for the effect of the other explanatory variable. But what does that mean?

Here’s where correlation plays one of its roles in a linear regression model. If the parameter estimate of an explanatory variable changes after adjusting for another explanatory variable, you can infer that the two variables are correlated, to some degree at least. If my friend, Peter, were to leave SAS^®, that would take some adjustment on my part, because we have a great relationship (a COR-relationship, if you will). However, if the night security guard were to leave SAS^®, my life wouldn’t be affected at all, unless, I guess, someone stole my computer at night now that there were no security guard. Oh, you get the point. Anyway, don’t just trust my word. Let’s see what the numbers say.

In order to demonstrate my assertion that adjusted parameter estimates are no different from raw parameter estimates in the case where the explanatory variables do not correlate, let me create perfectly uncorrelated variables, based on nRBI and nHome. I’ll do this by using principal components analysis. Principal components are linear combinations of the input variables, where components are generated perfectly uncorrelated. I’ll use PROC PRINCOMP with an OUT= option to produce two new uncorrelated measures from the two correlated variables. By the way, there are several missing salaries in the Baseball data, so I am limiting the analysis to those records that have non-missing values (there are no missing values for either nHome or nRBI). Documentation for PROC PRINCOMP can be found here.

proc princomp data=sashelp.baseball

out=work.bases

prefix=measure

noprint;

var nrbi nhome;

where Salary ne .;

run;

Let’s see the correlations among the two new measures and Salary.

proc corr data=work.bases

nosimple;

var Salary Measure1 Measure2;

run;

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum	Label
Salary	263	535.92588	451.11868	140949	67.50000	2460	1987 Salary in $ Thousands
measure1	263	0	1.36072	0	-2.08519	3.85143
measure2	263	0	0.38527	0	-0.90799	1.09361

Pearson Correlation Coefficients, N = 263 Prob > \|r\| under H0: Rho=0
	Salary	measure1	measure2
Salary 1987 Salary in $ Thousands	1.00000	0.47605 <.0001	0.21727 0.0004
measure1	0.47605 <.0001	1.00000	0.00000 1.0000
measure2	0.21727 0.0004	0.00000 1.0000	1.00000

The new measures are both correlated with Salary, but as was my plan, they are perfectly uncorrelated with one another. Keep in mind, however, that the two new measures are linear combinations of nHome and nRBI. The pair of measures contain all the information of those original variables and we will see evidence of that in the next regression models.

Notice that the means of both measures are zero, but their standard deviations (and therefore variances) differ.

So, what happens when we run the regression models?

proc reg data=work.bases;

model Salary=Measure1;

run;

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	535.92588	24.50978	21.87	<.0001
measure1		1	157.82356	18.04668	8.75	<.0001

The raw (unadjusted) parameter estimate for Measure1 is 157.82356.

proc reg data=work.bases;

model Salary=Measure2;

run;

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	535.92588	27.20462	19.70	<.0001
measure2		1	254.40347	70.74567	3.60	0.0004

The raw parameter estimate for Measure2 is 254.40347.

proc reg data=work.bases;

model Salary=Measure: ;

run;

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	14600270	7300135	49.02	<.0001
Error	260	38718842	148919
Corrected Total	262	53319113

Root MSE	385.89976	R-Square	0.2738
Dependent Mean	535.92588	Adj R-Sq	0.2682
Coeff Var	72.00618

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	535.92588	23.79560	22.52	<.0001
measure1		1	157.82356	17.52082	9.01	<.0001
measure2		1	254.40347	61.88050	4.11	<.0001

The adjusted parameter estimates for Measure1 and Measure2 are precisely the same as the unadjusted parameter estimates. Without correlation among the predictor variables, the apparent conundrum of the changing parameter estimates doesn’t exist. This is partly why “balanced and complete design” is used in experimental design. Balanced and complete design assures that the independent variables are uncorrelated.

Another interesting outcome shown in these tables is that the F-value and p-value (Pr > F) from the Analysis of Variance table are precisely the same as the values that occurred when using the original variables of nRBI and nHome. The R-Square and Adjusted R-Square are also the same. Linear transformations, such as the ones used to create principal component scores do not affect the explanatory power of a model.

Now, let’s go deeper into the relationship between Pearson correlation coefficients and regression parameters. This is where the magic happens. Okay, so it’s not magic, but would you be excited by more talk about matrices and linear transformations? To this point, I have created explanatory variables that are perfectly uncorrelated. In the next step, I’ll standardize all variables, including the dependent variable, Salary, so that each has a mean or zero and a variance and standard deviation of one (often known as “z-score standardization”).

I’ll use PROC STDIZE and its STD method. This method is the z-score method. Documentation for PROC STDIZE can be found here.

proc stdize method=std

data=work.bases

out=work.bases2;

var Salary measure1 measure2;

run;

Let’s look at the new correlation matrix.

proc corr data=work.bases2;

var Salary Measure1 Measure2;

run;

Simple Statistics
Variable	N	Mean	Std Dev	Sum	Minimum	Maximum	Label
Salary	263	0	1.00000	0	-1.03837	4.26512	1987 Salary in $ Thousands
measure1	263	0	1.00000	0	-1.53242	2.83043
measure2	263	0	1.00000	0	-2.35675	2.83852

Pearson Correlation Coefficients, N = 263 Prob > \|r\| under H0: Rho=0
	Salary	measure1	measure2
Salary 1987 Salary in $ Thousands	1.00000	0.47605 <.0001	0.21727 0.0004
measure1	0.47605 <.0001	1.00000	0.00000 1.0000
measure2	0.21727 0.0004	0.00000 1.0000	1.00000

Are you surprised that the Pearson correlation coefficients are all identical to the ones I obtained using the unstandardized variables? Well, the not-so-well-kept secret is that Pearson correlations are simply covariances on variables that have been z-score standardized. If you don’t standardize the variables yourself, they will be standardized during the process of calculating the Pearson correlations, anyway.

Let’s see how this affects the regression models.

proc reg data=work.bases2;

model Salary=Measure1;

run;

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	1.54097E-16	0.05433	0.00	1.0000
measure1		1	0.47605	0.05443	8.75	<.0001

The first thing you might notice is that the parameter estimate for Measure1, 0.47605, is exactly the same as the Pearson correlation coefficient for Measure1 with Salary. What is not obvious is that the intercept is now zero, or at least it should be. The only reason it isn’t is because of lack of precision. The value 1.54097E-16 is infinitesimally close to zero and should be zero if the parameters were estimated precisely. Oh, well.

proc reg data=work.bases2;

model Salary=Measure2;

run;

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	1.18586E-16	0.06030	0.00	1.0000
measure2		1	0.21727	0.06042	3.60	0.0004

Similarly, the parameter estimate for Measure2 is the same as its Pearson correlation coefficient with Salary. These results are not simply coincidence. It is a result of the linear transformations involved in z-score standardization of both X and Y variables in a regression model – the simple regression model (using one explanatory variable) will have an intercept of zero and a parameter estimate equal to the Pearson correlation coefficient with the dependent variable.

Now, let me put this all together in one final model.

proc reg data=work.bases2;

model Salary=Measure1 Measure2;

run;

Analysis of Variance
Source	DF	Sum of Squares	Mean Square	F Value	Pr > F
Model	2	71.74296	35.87148	49.02	<.0001
Error	260	190.25704	0.73176
Corrected Total	262	262.00000

Root MSE	0.85543	R-Square	0.2738
Dependent Mean	1.41838E-16	Adj R-Sq	0.2682
Coeff Var	6.031008E17

Parameter Estimates
Variable	Label	DF	Parameter Estimate	Standard Error	t Value	Pr > \|t\|
Intercept	Intercept	1	1.30845E-16	0.05275	0.00	1.0000
measure1		1	0.47605	0.05285	9.01	<.0001
measure2		1	0.21727	0.05285	4.11	<.0001

Once again, the F value and p-value from the Analysis of Variance table and the R-Square and Adjusted R-Square values are identical to the value that resulted from the first model I ran, where I’d used unstandardized variables, nHome and nRBI as explanatory variables. The linear transformations involved in creating the two principal component scores didn’t change the predictive power of model. Nor did standardization.

If you’re interested in learning how all of this relates to exploratory factor analysis, tune in later to my next blog post. The important points to take from this post for factor analysis are that Pearson correlation coefficients are equal to regression parameter estimates (with a Y-intercept of zero) when all variables are on the standardized metric of zero mean and unit variance (variance of one) and that the adjusted parameter estimates (and therefore adjusted correlations with the Y-variable) are the same as the raw (unadjusted) parameter estimates when the explanatory variables are uncorrelated.

SAS Communities Library