The purpose of this blog is to illustrate the relationship between a Pearson correlation coefficient and a slope parameter from a simple regression model.
How Does Correlation Relate to Linear Regression (and Factor Analysis)
For well over a decade, I’ve been teaching a course in multivariate methods, which begins with a discussion of both principal components analysis (PCA) and Exploratory Factor Analysis (EFA). Most of the course participants have been comfortable with PCA, but when it comes to factor analysis, they often feel challenged. I show them all the related matrix algebra, but I know that’s not really the problem. Any basic statistics text can provide that. I want them to understand conceptually what is actually happening when we say that we “infer factors from an observed covariance matrix”.
The matrix algebra of exploratory factor analysis looks exactly like that of linear regression and you can think of exploratory factor analysis as a series of simultaneous regression models. However, it is not so obvious why correlation matrices in factor analysis have anything to do with the implied regression models. So, I’m going to present here the first step in a two-step approach to understanding exploratory factor analysis - an explanation of the relationship between regression coefficients and Pearson correlation coefficients. The beauty of presenting it this way is that even if you don’t care about factor analysis at all, but only want to understand linear regression a bit better, there’s something here for you. I will mostly avoid mathematical formulas because, as I said, you can find those anywhere. Instead, I’ll describe concepts and use various statistical procedures in SAS ® to explain my points.
Let me start with a fairly simple set of regression models. I’ll be using the baseball data set in the sashelp library. I’ll regress the variable Salary on explanatory variables, nRBI (runs batted in) and nHome (number of home runs), using data from American Major League Baseball in the 1986 season. It’s not important that you know what those measures are, but it might make it a bit more interesting if you do.
Most of my course participants are aware that linear regression is related to Pearson correlations, but they might have forgotten how. I’ll start with a correlation matrix of all three variables to be used in my regression models. Documentation for PROC CORR can be found here.
proc corr data=sashelp.baseball
nosimple;
var Salary nRBI nHome;
run;
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations
Salary
nRBI
nHome
Salary 1987 Salary in $ Thousands
1.00000
263
0.51723 <.0001 263
0.39885 <.0001
263
nRBI RBIs in 1986
0.51723 <.0001
263
1.00000
322
0.85394 <.0001
322
nHome Home Runs in 1986
0.39885 <.0001
263
0.85394 <.0001 322
1.00000
322
I’ll just mention the fact that the variables are all correlated among each other to one degree or another.
Now I’ll regress Salary on each of the explanatory variables in separate models and just show tables relevant to this discussion. PROC REG documentation can be found here.
proc reg data=sashelp.baseball;
model Salary=nRBI;
run;
Root MSE
386.82654
R-Square
0.2675
Dependent Mean
535.92588
Adj R-Sq
0.2647
Coeff Var
72.17911
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
61.43175
54.13626
1.13
0.2575
nRBI
RBIs in 1986
1
9.07446
0.92941
9.76
<.0001
proc reg data=sashelp.baseball;
model Salary=nHome;
run;
Root MSE
414.47484
R-Square
0.1591
Dependent Mean
535.92588
Adj R-Sq
0.1559
Coeff Var
77.33809
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
294.86052
42.78033
6.89
<.0001
nHome
Home Runs in 1986
1
20.38591
2.90120
7.03
<.0001
At this point, it’s not entirely obvious how the regression results are related to the correlation results, but we’ll get there. I promise. Let me just point out that each of the explanatory variables, nRBI and nHome, have p-values <.0001, as seen in the parameter estimates tables for their respective models. These values would be considered “statistically significant” in most organizations. The next step is to combine the models by including both measures in a single model.
proc reg data=sashelp.baseball;
model Salary=nRBI nHome;
run;
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model
2
14600270
7300135
49.02
<.0001
Error
260
38718842
148919
Corrected Total
262
53319113
Root MSE
385.89976
R-Square
0.2738
Dependent Mean
535.92588
Adj R-Sq
0.2682
Coeff Var
72.00618
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
34.66715
56.87140
0.61
0.5427
nRBI
RBIs in 1986
1
11.33615
1.76860
6.41
<.0001
nHome
Home Runs in 1986
1
-7.73751
5.15245
-1.50
0.1344
At this point, some questions arise. Why are the regression coefficients (parameter estimates) for nRBI and nHome so very different from what they were in the individual models? In addition, why is the coefficient for nHome now negative? This seems to indicate that for each extra home run a player hits, his expected salary is reduced by over $7,700.00! Wow, if this were really true, I imagine we’d never see another home run hit in Major League Baseball ® . However, the p-value for this estimate is so high, few researchers would consider this value to be “statistically significant”. Still, this is perplexing. The standard explanation is the parameter estimates and p-values are estimated separately for each explanatory variable adjusting for the effect of the other explanatory variable. But what does that mean?
Here’s where correlation plays one of its roles in a linear regression model. If the parameter estimate of an explanatory variable changes after adjusting for another explanatory variable, you can infer that the two variables are correlated, to some degree at least. If my friend, Peter, were to leave SAS ® , that would take some adjustment on my part, because we have a great relationship (a COR-relationship, if you will). However, if the night security guard were to leave SAS ® , my life wouldn’t be affected at all, unless, I guess, someone stole my computer at night now that there were no security guard. Oh, you get the point. Anyway, don’t just trust my word. Let’s see what the numbers say.
In order to demonstrate my assertion that adjusted parameter estimates are no different from raw parameter estimates in the case where the explanatory variables do not correlate, let me create perfectly uncorrelated variables, based on nRBI and nHome. I’ll do this by using principal components analysis. Principal components are linear combinations of the input variables, where components are generated perfectly uncorrelated. I’ll use PROC PRINCOMP with an OUT= option to produce two new uncorrelated measures from the two correlated variables. By the way, there are several missing salaries in the Baseball data, so I am limiting the analysis to those records that have non-missing values (there are no missing values for either nHome or nRBI). Documentation for PROC PRINCOMP can be found here.
proc princomp data=sashelp.baseball
out=work.bases
prefix=measure
noprint;
var nrbi nhome;
where Salary ne .;
run;
Let’s see the correlations among the two new measures and Salary.
proc corr data=work.bases
nosimple;
var Salary Measure1 Measure2;
run;
Simple Statistics
Variable
N
Mean
Std Dev
Sum
Minimum
Maximum
Label
Salary
263
535.92588
451.11868
140949
67.50000
2460
1987 Salary in $ Thousands
measure1
263
0
1.36072
0
-2.08519
3.85143
measure2
263
0
0.38527
0
-0.90799
1.09361
Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0
Salary
measure1
measure2
Salary 1987 Salary in $ Thousands
1.00000
0.47605 <.0001
0.21727 0.0004
measure1
0.47605 <.0001
1.00000
0.00000 1.0000
measure2
0.21727 0.0004
0.00000 1.0000
1.00000
The new measures are both correlated with Salary, but as was my plan, they are perfectly uncorrelated with one another. Keep in mind, however, that the two new measures are linear combinations of nHome and nRBI. The pair of measures contain all the information of those original variables and we will see evidence of that in the next regression models.
Notice that the means of both measures are zero, but their standard deviations (and therefore variances) differ.
So, what happens when we run the regression models?
proc reg data=work.bases;
model Salary=Measure1;
run;
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
535.92588
24.50978
21.87
<.0001
measure1
1
157.82356
18.04668
8.75
<.0001
The raw (unadjusted) parameter estimate for Measure1 is 157.82356.
proc reg data=work.bases;
model Salary=Measure2;
run;
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
535.92588
27.20462
19.70
<.0001
measure2
1
254.40347
70.74567
3.60
0.0004
The raw parameter estimate for Measure2 is 254.40347.
proc reg data=work.bases;
model Salary=Measure: ;
run;
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model
2
14600270
7300135
49.02
<.0001
Error
260
38718842
148919
Corrected Total
262
53319113
Root MSE
385.89976
R-Square
0.2738
Dependent Mean
535.92588
Adj R-Sq
0.2682
Coeff Var
72.00618
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
535.92588
23.79560
22.52
<.0001
measure1
1
157.82356
17.52082
9.01
<.0001
measure2
1
254.40347
61.88050
4.11
<.0001
The adjusted parameter estimates for Measure1 and Measure2 are precisely the same as the unadjusted parameter estimates. Without correlation among the predictor variables, the apparent conundrum of the changing parameter estimates doesn’t exist. This is partly why “balanced and complete design” is used in experimental design. Balanced and complete design assures that the independent variables are uncorrelated.
Another interesting outcome shown in these tables is that the F-value and p-value (Pr > F) from the Analysis of Variance table are precisely the same as the values that occurred when using the original variables of nRBI and nHome. The R-Square and Adjusted R-Square are also the same. Linear transformations, such as the ones used to create principal component scores do not affect the explanatory power of a model.
Now, let’s go deeper into the relationship between Pearson correlation coefficients and regression parameters. This is where the magic happens. Okay, so it’s not magic, but would you be excited by more talk about matrices and linear transformations? To this point, I have created explanatory variables that are perfectly uncorrelated. In the next step, I’ll standardize all variables, including the dependent variable, Salary, so that each has a mean or zero and a variance and standard deviation of one (often known as “z-score standardization”).
I’ll use PROC STDIZE and its STD method. This method is the z-score method. Documentation for PROC STDIZE can be found here.
proc stdize method=std
data=work.bases
out=work.bases2;
var Salary measure1 measure2;
run;
Let’s look at the new correlation matrix.
proc corr data=work.bases2;
var Salary Measure1 Measure2;
run;
Simple Statistics
Variable
N
Mean
Std Dev
Sum
Minimum
Maximum
Label
Salary
263
0
1.00000
0
-1.03837
4.26512
1987 Salary in $ Thousands
measure1
263
0
1.00000
0
-1.53242
2.83043
measure2
263
0
1.00000
0
-2.35675
2.83852
Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0
Salary
measure1
measure2
Salary 1987 Salary in $ Thousands
1.00000
0.47605 <.0001
0.21727 0.0004
measure1
0.47605 <.0001
1.00000
0.00000 1.0000
measure2
0.21727 0.0004
0.00000 1.0000
1.00000
Are you surprised that the Pearson correlation coefficients are all identical to the ones I obtained using the unstandardized variables? Well, the not-so-well-kept secret is that Pearson correlations are simply covariances on variables that have been z-score standardized. If you don’t standardize the variables yourself, they will be standardized during the process of calculating the Pearson correlations, anyway.
Let’s see how this affects the regression models.
proc reg data=work.bases2;
model Salary=Measure1;
run;
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
1.54097E-16
0.05433
0.00
1.0000
measure1
1
0.47605
0.05443
8.75
<.0001
The first thing you might notice is that the parameter estimate for Measure1, 0.47605, is exactly the same as the Pearson correlation coefficient for Measure1 with Salary. What is not obvious is that the intercept is now zero, or at least it should be. The only reason it isn’t is because of lack of precision. The value 1.54097E-16 is infinitesimally close to zero and should be zero if the parameters were estimated precisely. Oh, well.
proc reg data=work.bases2;
model Salary=Measure2;
run;
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
1.18586E-16
0.06030
0.00
1.0000
measure2
1
0.21727
0.06042
3.60
0.0004
Similarly, the parameter estimate for Measure2 is the same as its Pearson correlation coefficient with Salary. These results are not simply coincidence. It is a result of the linear transformations involved in z-score standardization of both X and Y variables in a regression model – the simple regression model (using one explanatory variable) will have an intercept of zero and a parameter estimate equal to the Pearson correlation coefficient with the dependent variable.
Now, let me put this all together in one final model.
proc reg data=work.bases2;
model Salary=Measure1 Measure2;
run;
Analysis of Variance
Source
DF
Sum of Squares
Mean Square
F Value
Pr > F
Model
2
71.74296
35.87148
49.02
<.0001
Error
260
190.25704
0.73176
Corrected Total
262
262.00000
Root MSE
0.85543
R-Square
0.2738
Dependent Mean
1.41838E-16
Adj R-Sq
0.2682
Coeff Var
6.031008E17
Parameter Estimates
Variable
Label
DF
Parameter Estimate
Standard Error
t Value
Pr > |t|
Intercept
Intercept
1
1.30845E-16
0.05275
0.00
1.0000
measure1
1
0.47605
0.05285
9.01
<.0001
measure2
1
0.21727
0.05285
4.11
<.0001
Once again, the F value and p-value from the Analysis of Variance table and the R-Square and Adjusted R-Square values are identical to the value that resulted from the first model I ran, where I’d used unstandardized variables, nHome and nRBI as explanatory variables. The linear transformations involved in creating the two principal component scores didn’t change the predictive power of model. Nor did standardization.
If you’re interested in learning how all of this relates to exploratory factor analysis, tune in later to my next blog post. The important points to take from this post for factor analysis are that Pearson correlation coefficients are equal to regression parameter estimates (with a Y-intercept of zero) when all variables are on the standardized metric of zero mean and unit variance (variance of one) and that the adjusted parameter estimates (and therefore adjusted correlations with the Y-variable) are the same as the raw (unadjusted) parameter estimates when the explanatory variables are uncorrelated.
... View more