The purpose of this blog is to illustrate the relationship between a Pearson correlation coefficient and a slope parameter from a simple regression model.
How Does Correlation Relate to Linear Regression (and Factor Analysis)
For well over a decade, I’ve been teaching a course in multivariate methods, which begins with a discussion of both principal components analysis (PCA) and Exploratory Factor Analysis (EFA). Most of the course participants have been comfortable with PCA, but when it comes to factor analysis, they often feel challenged. I show them all the related matrix algebra, but I know that’s not really the problem. Any basic statistics text can provide that. I want them to understand conceptually what is actually happening when we say that we “infer factors from an observed covariance matrix”.
The matrix algebra of exploratory factor analysis looks exactly like that of linear regression and you can think of exploratory factor analysis as a series of simultaneous regression models. However, it is not so obvious why correlation matrices in factor analysis have anything to do with the implied regression models. So, I’m going to present here the first step in a two-step approach to understanding exploratory factor analysis - an explanation of the relationship between regression coefficients and Pearson correlation coefficients. The beauty of presenting it this way is that even if you don’t care about factor analysis at all, but only want to understand linear regression a bit better, there’s something here for you. I will mostly avoid mathematical formulas because, as I said, you can find those anywhere. Instead, I’ll describe concepts and use various statistical procedures in SAS® to explain my points.
Let me start with a fairly simple set of regression models. I’ll be using the baseball data set in the sashelp library. I’ll regress the variable Salary on explanatory variables, nRBI (runs batted in) and nHome (number of home runs), using data from American Major League Baseball in the 1986 season. It’s not important that you know what those measures are, but it might make it a bit more interesting if you do.
Most of my course participants are aware that linear regression is related to Pearson correlations, but they might have forgotten how. I’ll start with a correlation matrix of all three variables to be used in my regression models. Documentation for PROC CORR can be found here.
proc corr data=sashelp.baseball
nosimple;
var Salary nRBI nHome;
run;
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations |
|||
Salary | nRBI | nHome | |
Salary 1987 Salary in $ Thousands | 1.00000
263 |
0.51723 <.0001 263 |
0.39885 <.0001 263 |
nRBI RBIs in 1986 |
0.51723 <.0001 263 |
1.00000
322 |
0.85394 <.0001 322 |
nHome Home Runs in 1986 |
0.39885 <.0001 263 |
0.85394 <.0001 322 | 1.00000
322 |
I’ll just mention the fact that the variables are all correlated among each other to one degree or another.
Now I’ll regress Salary on each of the explanatory variables in separate models and just show tables relevant to this discussion. PROC REG documentation can be found here.
proc reg data=sashelp.baseball;
model Salary=nRBI;
run;
Root MSE | 386.82654 | R-Square | 0.2675 |
Dependent Mean | 535.92588 | Adj R-Sq | 0.2647 |
Coeff Var | 72.17911 |
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 61.43175 | 54.13626 | 1.13 | 0.2575 |
nRBI | RBIs in 1986 | 1 | 9.07446 | 0.92941 | 9.76 | <.0001 |
proc reg data=sashelp.baseball;
model Salary=nHome;
run;
Root MSE | 414.47484 | R-Square | 0.1591 |
Dependent Mean | 535.92588 | Adj R-Sq | 0.1559 |
Coeff Var | 77.33809 |
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 294.86052 | 42.78033 | 6.89 | <.0001 |
nHome | Home Runs in 1986 | 1 | 20.38591 | 2.90120 | 7.03 | <.0001 |
At this point, it’s not entirely obvious how the regression results are related to the correlation results, but we’ll get there. I promise. Let me just point out that each of the explanatory variables, nRBI and nHome, have p-values <.0001, as seen in the parameter estimates tables for their respective models. These values would be considered “statistically significant” in most organizations. The next step is to combine the models by including both measures in a single model.
proc reg data=sashelp.baseball;
model Salary=nRBI nHome;
run;
Analysis of Variance |
|||||
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 14600270 | 7300135 | 49.02 | <.0001 |
Error | 260 | 38718842 | 148919 | ||
Corrected Total | 262 | 53319113 |
Root MSE | 385.89976 | R-Square | 0.2738 |
Dependent Mean | 535.92588 | Adj R-Sq | 0.2682 |
Coeff Var | 72.00618 |
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 34.66715 | 56.87140 | 0.61 | 0.5427 |
nRBI | RBIs in 1986 | 1 | 11.33615 | 1.76860 | 6.41 | <.0001 |
nHome | Home Runs in 1986 | 1 | -7.73751 | 5.15245 | -1.50 | 0.1344 |
At this point, some questions arise. Why are the regression coefficients (parameter estimates) for nRBI and nHome so very different from what they were in the individual models? In addition, why is the coefficient for nHome now negative? This seems to indicate that for each extra home run a player hits, his expected salary is reduced by over $7,700.00! Wow, if this were really true, I imagine we’d never see another home run hit in Major League Baseball®. However, the p-value for this estimate is so high, few researchers would consider this value to be “statistically significant”. Still, this is perplexing. The standard explanation is the parameter estimates and p-values are estimated separately for each explanatory variable adjusting for the effect of the other explanatory variable. But what does that mean?
Here’s where correlation plays one of its roles in a linear regression model. If the parameter estimate of an explanatory variable changes after adjusting for another explanatory variable, you can infer that the two variables are correlated, to some degree at least. If my friend, Peter, were to leave SAS®, that would take some adjustment on my part, because we have a great relationship (a COR-relationship, if you will). However, if the night security guard were to leave SAS®, my life wouldn’t be affected at all, unless, I guess, someone stole my computer at night now that there were no security guard. Oh, you get the point. Anyway, don’t just trust my word. Let’s see what the numbers say.
In order to demonstrate my assertion that adjusted parameter estimates are no different from raw parameter estimates in the case where the explanatory variables do not correlate, let me create perfectly uncorrelated variables, based on nRBI and nHome. I’ll do this by using principal components analysis. Principal components are linear combinations of the input variables, where components are generated perfectly uncorrelated. I’ll use PROC PRINCOMP with an OUT= option to produce two new uncorrelated measures from the two correlated variables. By the way, there are several missing salaries in the Baseball data, so I am limiting the analysis to those records that have non-missing values (there are no missing values for either nHome or nRBI). Documentation for PROC PRINCOMP can be found here.
proc princomp data=sashelp.baseball
out=work.bases
prefix=measure
noprint;
var nrbi nhome;
where Salary ne .;
run;
Let’s see the correlations among the two new measures and Salary.
proc corr data=work.bases
nosimple;
var Salary Measure1 Measure2;
run;
Simple Statistics |
|||||||
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum | Label |
Salary | 263 | 535.92588 | 451.11868 | 140949 | 67.50000 | 2460 | 1987 Salary in $ Thousands |
measure1 | 263 | 0 | 1.36072 | 0 | -2.08519 | 3.85143 | |
measure2 | 263 | 0 | 0.38527 | 0 | -0.90799 | 1.09361 |
Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0 |
|||
Salary | measure1 | measure2 | |
Salary 1987 Salary in $ Thousands | 1.00000 | 0.47605 <.0001 | 0.21727 0.0004 |
measure1
|
0.47605 <.0001 | 1.00000 | 0.00000 1.0000 |
measure2
|
0.21727 0.0004 | 0.00000 1.0000 | 1.00000 |
The new measures are both correlated with Salary, but as was my plan, they are perfectly uncorrelated with one another. Keep in mind, however, that the two new measures are linear combinations of nHome and nRBI. The pair of measures contain all the information of those original variables and we will see evidence of that in the next regression models.
Notice that the means of both measures are zero, but their standard deviations (and therefore variances) differ.
So, what happens when we run the regression models?
proc reg data=work.bases;
model Salary=Measure1;
run;
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 535.92588 | 24.50978 | 21.87 | <.0001 |
measure1 | 1 | 157.82356 | 18.04668 | 8.75 | <.0001 |
The raw (unadjusted) parameter estimate for Measure1 is 157.82356.
proc reg data=work.bases;
model Salary=Measure2;
run;
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 535.92588 | 27.20462 | 19.70 | <.0001 |
measure2 | 1 | 254.40347 | 70.74567 | 3.60 | 0.0004 |
The raw parameter estimate for Measure2 is 254.40347.
proc reg data=work.bases;
model Salary=Measure: ;
run;
Analysis of Variance |
|||||
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 14600270 | 7300135 | 49.02 | <.0001 |
Error | 260 | 38718842 | 148919 | ||
Corrected Total | 262 | 53319113 |
Root MSE | 385.89976 | R-Square | 0.2738 |
Dependent Mean | 535.92588 | Adj R-Sq | 0.2682 |
Coeff Var | 72.00618 |
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 535.92588 | 23.79560 | 22.52 | <.0001 |
measure1 | 1 | 157.82356 | 17.52082 | 9.01 | <.0001 | |
measure2 | 1 | 254.40347 | 61.88050 | 4.11 | <.0001 |
The adjusted parameter estimates for Measure1 and Measure2 are precisely the same as the unadjusted parameter estimates. Without correlation among the predictor variables, the apparent conundrum of the changing parameter estimates doesn’t exist. This is partly why “balanced and complete design” is used in experimental design. Balanced and complete design assures that the independent variables are uncorrelated.
Another interesting outcome shown in these tables is that the F-value and p-value (Pr > F) from the Analysis of Variance table are precisely the same as the values that occurred when using the original variables of nRBI and nHome. The R-Square and Adjusted R-Square are also the same. Linear transformations, such as the ones used to create principal component scores do not affect the explanatory power of a model.
Now, let’s go deeper into the relationship between Pearson correlation coefficients and regression parameters. This is where the magic happens. Okay, so it’s not magic, but would you be excited by more talk about matrices and linear transformations? To this point, I have created explanatory variables that are perfectly uncorrelated. In the next step, I’ll standardize all variables, including the dependent variable, Salary, so that each has a mean or zero and a variance and standard deviation of one (often known as “z-score standardization”).
I’ll use PROC STDIZE and its STD method. This method is the z-score method. Documentation for PROC STDIZE can be found here.
proc stdize method=std
data=work.bases
out=work.bases2;
var Salary measure1 measure2;
run;
Let’s look at the new correlation matrix.
proc corr data=work.bases2;
var Salary Measure1 Measure2;
run;
Simple Statistics |
|||||||
Variable | N | Mean | Std Dev | Sum | Minimum | Maximum | Label |
Salary | 263 | 0 | 1.00000 | 0 | -1.03837 | 4.26512 | 1987 Salary in $ Thousands |
measure1 | 263 | 0 | 1.00000 | 0 | -1.53242 | 2.83043 | |
measure2 | 263 | 0 | 1.00000 | 0 | -2.35675 | 2.83852 |
Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0 |
|||
Salary | measure1 | measure2 | |
Salary 1987 Salary in $ Thousands | 1.00000 | 0.47605 <.0001 | 0.21727 0.0004 |
measure1 | 0.47605 <.0001 | 1.00000 | 0.00000 1.0000 |
measure2
|
0.21727 0.0004 | 0.00000 1.0000 | 1.00000 |
Are you surprised that the Pearson correlation coefficients are all identical to the ones I obtained using the unstandardized variables? Well, the not-so-well-kept secret is that Pearson correlations are simply covariances on variables that have been z-score standardized. If you don’t standardize the variables yourself, they will be standardized during the process of calculating the Pearson correlations, anyway.
Let’s see how this affects the regression models.
proc reg data=work.bases2;
model Salary=Measure1;
run;
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 1.54097E-16 | 0.05433 | 0.00 | 1.0000 |
measure1 | 1 | 0.47605 | 0.05443 | 8.75 | <.0001 |
The first thing you might notice is that the parameter estimate for Measure1, 0.47605, is exactly the same as the Pearson correlation coefficient for Measure1 with Salary. What is not obvious is that the intercept is now zero, or at least it should be. The only reason it isn’t is because of lack of precision. The value 1.54097E-16 is infinitesimally close to zero and should be zero if the parameters were estimated precisely. Oh, well.
proc reg data=work.bases2;
model Salary=Measure2;
run;
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 1.18586E-16 | 0.06030 | 0.00 | 1.0000 |
measure2 | 1 | 0.21727 | 0.06042 | 3.60 | 0.0004 |
Similarly, the parameter estimate for Measure2 is the same as its Pearson correlation coefficient with Salary. These results are not simply coincidence. It is a result of the linear transformations involved in z-score standardization of both X and Y variables in a regression model – the simple regression model (using one explanatory variable) will have an intercept of zero and a parameter estimate equal to the Pearson correlation coefficient with the dependent variable.
Now, let me put this all together in one final model.
proc reg data=work.bases2;
model Salary=Measure1 Measure2;
run;
Analysis of Variance |
|||||
Source | DF | Sum of Squares | Mean Square | F Value | Pr > F |
Model | 2 | 71.74296 | 35.87148 | 49.02 | <.0001 |
Error | 260 | 190.25704 | 0.73176 | ||
Corrected Total | 262 | 262.00000 |
Root MSE | 0.85543 | R-Square | 0.2738 |
Dependent Mean | 1.41838E-16 | Adj R-Sq | 0.2682 |
Coeff Var | 6.031008E17 |
Parameter Estimates |
||||||
Variable | Label | DF | Parameter Estimate | Standard Error | t Value | Pr > |t| |
Intercept | Intercept | 1 | 1.30845E-16 | 0.05275 | 0.00 | 1.0000 |
measure1 | 1 | 0.47605 | 0.05285 | 9.01 | <.0001 | |
measure2 | 1 | 0.21727 | 0.05285 | 4.11 | <.0001 |
Once again, the F value and p-value from the Analysis of Variance table and the R-Square and Adjusted R-Square values are identical to the value that resulted from the first model I ran, where I’d used unstandardized variables, nHome and nRBI as explanatory variables. The linear transformations involved in creating the two principal component scores didn’t change the predictive power of model. Nor did standardization.
If you’re interested in learning how all of this relates to exploratory factor analysis, tune in later to my next blog post. The important points to take from this post for factor analysis are that Pearson correlation coefficients are equal to regression parameter estimates (with a Y-intercept of zero) when all variables are on the standardized metric of zero mean and unit variance (variance of one) and that the adjusted parameter estimates (and therefore adjusted correlations with the Y-variable) are the same as the raw (unadjusted) parameter estimates when the explanatory variables are uncorrelated.
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning and boost your career prospects.