BookmarkSubscribeRSS Feed

How Correlation Relates to Linear Regression and Factor Analysis

Started ‎08-25-2023 by
Modified ‎08-25-2023 by
Views 926

 

The purpose of this blog is to illustrate the relationship between a Pearson correlation coefficient and a slope parameter from a simple regression model.

 

How Does Correlation Relate to Linear Regression (and Factor Analysis)

 

For well over a decade, I’ve been teaching a course in multivariate methods, which begins with a discussion of both principal components analysis (PCA) and Exploratory Factor Analysis (EFA).  Most of the course participants have been comfortable with PCA, but when it comes to factor analysis, they often feel challenged.  I show them all the related matrix algebra, but I know that’s not really the problem.  Any basic statistics text can provide that.  I want them to understand conceptually what is actually happening when we say that we “infer factors from an observed covariance matrix”.

 

The matrix algebra of exploratory factor analysis looks exactly like that of linear regression and you can think of exploratory factor analysis as a series of simultaneous regression models. However, it is not so obvious why correlation matrices in factor analysis have anything to do with the implied regression models. So, I’m going to present here the first step in a two-step approach to understanding exploratory factor analysis - an explanation of the relationship between regression coefficients and Pearson correlation coefficients.  The beauty of presenting it this way is that even if you don’t care about factor analysis at all, but only want to understand linear regression a bit better, there’s something here for you.  I will mostly avoid mathematical formulas because, as I said, you can find those anywhere.  Instead, I’ll describe concepts and use various statistical procedures in SAS® to explain my points.

 

Let me start with a fairly simple set of regression models.  I’ll be using the baseball data set in the sashelp library.  I’ll regress the variable Salary on explanatory variables, nRBI (runs batted in) and nHome (number of home runs), using data from American Major League Baseball in the 1986 season.  It’s not important that you know what those measures are, but it might make it a bit more interesting if you do.

 

Most of my course participants are aware that linear regression is related to Pearson correlations, but they might have forgotten how.  I’ll start with a correlation matrix of all three variables to be used in my regression models. Documentation for PROC CORR can be found here.

 

proc corr data=sashelp.baseball

nosimple;

var Salary nRBI nHome;

run;

Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations

  Salary nRBI nHome
Salary 1987 Salary in $ Thousands 1.00000

 

263
0.51723 <.0001 263

0.39885 <.0001

263

nRBI RBIs in 1986

0.51723 <.0001

263

1.00000

 

322

0.85394 <.0001

322

nHome Home Runs in 1986

0.39885 <.0001

263

0.85394 <.0001 322 1.00000

 

322

 

I’ll just mention the fact that the variables are all correlated among each other to one degree or another.

 

Now I’ll regress Salary on each of the explanatory variables in separate models and just show tables relevant to this discussion.  PROC REG documentation can be found here.

 

proc reg data=sashelp.baseball;

model Salary=nRBI;

run;
Root MSE 386.82654 R-Square 0.2675
Dependent Mean 535.92588 Adj R-Sq 0.2647
Coeff Var 72.17911    

 

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 61.43175 54.13626 1.13 0.2575
nRBI RBIs in 1986 1 9.07446 0.92941 9.76 <.0001

 

proc reg data=sashelp.baseball;

model Salary=nHome;

run;
Root MSE 414.47484 R-Square 0.1591
Dependent Mean 535.92588 Adj R-Sq 0.1559
Coeff Var 77.33809    

 

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 294.86052 42.78033 6.89 <.0001
nHome Home Runs in 1986 1 20.38591 2.90120 7.03 <.0001

 

At this point, it’s not entirely obvious how the regression results are related to the correlation results, but we’ll get there.  I promise.  Let me just point out that each of the explanatory variables, nRBI and nHome, have p-values <.0001, as seen in the parameter estimates tables for their respective models.  These values would be considered “statistically significant” in most organizations. The next step is to combine the models by including both measures in a single model.

 

proc reg data=sashelp.baseball;

model Salary=nRBI nHome;

run;

Analysis of Variance

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 14600270 7300135 49.02 <.0001
Error 260 38718842 148919    
Corrected Total 262 53319113      

 

Root MSE 385.89976 R-Square 0.2738
Dependent Mean 535.92588 Adj R-Sq 0.2682
Coeff Var 72.00618    

 

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 34.66715 56.87140 0.61 0.5427
nRBI RBIs in 1986 1 11.33615 1.76860 6.41 <.0001
nHome Home Runs in 1986 1 -7.73751 5.15245 -1.50 0.1344

 

At this point, some questions arise.  Why are the regression coefficients (parameter estimates) for nRBI and nHome so very different from what they were in the individual models?  In addition, why is the coefficient for nHome now negative?  This seems to indicate that for each extra home run a player hits, his expected salary is reduced by over $7,700.00!  Wow, if this were really true, I imagine we’d never see another home run hit in Major League Baseball®.  However, the p-value for this estimate is so high, few researchers would consider this value to be “statistically significant”.  Still, this is perplexing.  The standard explanation is the parameter estimates and p-values are estimated separately for each explanatory variable adjusting for the effect of the other explanatory variable.  But what does that mean?

 

Here’s where correlation plays one of its roles in a linear regression model.  If the parameter estimate of an explanatory variable changes after adjusting for another explanatory variable, you can infer that the two variables are correlated, to some degree at least.  If my friend, Peter, were to leave SAS®, that would take some adjustment on my part, because we have a great relationship (a COR-relationship, if you will).  However, if the night security guard were to leave SAS®, my life wouldn’t be affected at all, unless, I guess, someone stole my computer at night now that there were no security guard.  Oh, you get the point.  Anyway, don’t just trust my word.  Let’s see what the numbers say.

 

In order to demonstrate my assertion that adjusted parameter estimates are no different from raw parameter estimates in the case where the explanatory variables do not correlate, let me create perfectly uncorrelated variables, based on nRBI and nHome.  I’ll do this by using principal components analysis.  Principal components are linear combinations of the input variables, where components are generated perfectly uncorrelated.  I’ll use PROC PRINCOMP with an OUT= option to produce two new uncorrelated measures from the two correlated variables.  By the way, there are several missing salaries in the Baseball data, so I am limiting the analysis to those records that have non-missing values (there are no missing values for either nHome or nRBI).  Documentation for PROC PRINCOMP can be found here.

 

proc princomp data=sashelp.baseball

out=work.bases

prefix=measure

noprint;

var nrbi nhome;

where Salary ne .;

run;

 

Let’s see the correlations among the two new measures and Salary.

 

proc corr data=work.bases

nosimple;

var Salary Measure1 Measure2;

run;

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum Label
Salary 263 535.92588 451.11868 140949 67.50000 2460 1987 Salary in $ Thousands
measure1 263 0 1.36072 0 -2.08519 3.85143  
measure2 263 0 0.38527 0 -0.90799 1.09361  

 

Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0

  Salary measure1 measure2
Salary 1987 Salary in $ Thousands 1.00000 0.47605 <.0001 0.21727 0.0004
measure1

 

 
0.47605 <.0001 1.00000 0.00000 1.0000
measure2

 

 
0.21727 0.0004 0.00000 1.0000 1.00000

 

The new measures are both correlated with Salary, but as was my plan, they are perfectly uncorrelated with one another.  Keep in mind, however, that the two new measures are linear combinations of nHome and nRBI.  The pair of measures contain all the information of those original variables and we will see evidence of that in the next regression models.

 

Notice that the means of both measures are zero, but their standard deviations (and therefore variances) differ.

 

So, what happens when we run the regression models?

 

proc reg data=work.bases;

model Salary=Measure1;

run;

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 535.92588 24.50978 21.87 <.0001
measure1   1 157.82356 18.04668 8.75 <.0001

 

The raw (unadjusted) parameter estimate for Measure1 is 157.82356.

 

proc reg data=work.bases;

model Salary=Measure2;

run;

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 535.92588 27.20462 19.70 <.0001
measure2   1 254.40347 70.74567 3.60 0.0004

 

The raw parameter estimate for Measure2 is 254.40347.

 

proc reg data=work.bases;

model Salary=Measure: ;

run;

Analysis of Variance

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 14600270 7300135 49.02 <.0001
Error 260 38718842 148919    
Corrected Total 262 53319113      

 

Root MSE 385.89976 R-Square 0.2738
Dependent Mean 535.92588 Adj R-Sq 0.2682
Coeff Var 72.00618    

 

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 535.92588 23.79560 22.52 <.0001
measure1   1 157.82356 17.52082 9.01 <.0001
measure2   1 254.40347 61.88050 4.11 <.0001

 

The adjusted parameter estimates for Measure1 and Measure2 are precisely the same as the unadjusted parameter estimates.  Without correlation among the predictor variables, the apparent conundrum of the changing parameter estimates doesn’t exist.  This is partly why “balanced and complete design” is used in experimental design.  Balanced and complete design assures that the independent variables are uncorrelated.

 

Another interesting outcome shown in these tables is that the F-value and p-value (Pr > F) from the Analysis of Variance table are precisely the same as the values that occurred when using the original variables of nRBI and nHome.  The R-Square and Adjusted R-Square are also the same.  Linear transformations, such as the ones used to create principal component scores do not affect the explanatory power of a model.

 

Now, let’s go deeper into the relationship between Pearson correlation coefficients and regression parameters.  This is where the magic happens.  Okay, so it’s not magic, but would you be excited by more talk about matrices and linear transformations?  To this point, I have created explanatory variables that are perfectly uncorrelated.  In the next step, I’ll standardize all variables, including the dependent variable, Salary, so that each has a mean or zero and a variance and standard deviation of one (often known as “z-score standardization”).

 

I’ll use PROC STDIZE and its STD method.  This method is the z-score method.  Documentation for PROC STDIZE can be found here.

 

proc stdize method=std

data=work.bases

out=work.bases2;

var Salary measure1 measure2;

run;

Let’s look at the new correlation matrix.

 

proc corr data=work.bases2;

var Salary Measure1 Measure2;

run;

Simple Statistics

Variable N Mean Std Dev Sum Minimum Maximum Label
Salary 263 0 1.00000 0 -1.03837 4.26512 1987 Salary in $ Thousands
measure1 263 0 1.00000 0 -1.53242 2.83043  
measure2 263 0 1.00000 0 -2.35675 2.83852  

 

Pearson Correlation Coefficients, N = 263 Prob > |r| under H0: Rho=0

  Salary measure1 measure2
Salary 1987 Salary in $ Thousands 1.00000 0.47605 <.0001 0.21727 0.0004
measure1   0.47605 <.0001 1.00000 0.00000 1.0000
measure2

 

 
0.21727 0.0004 0.00000 1.0000 1.00000

 

Are you surprised that the Pearson correlation coefficients are all identical to the ones I obtained using the unstandardized variables?  Well, the not-so-well-kept secret is that Pearson correlations are simply covariances on variables that have been z-score standardized.  If you don’t standardize the variables yourself, they will be standardized during the process of calculating the Pearson correlations, anyway.

 

Let’s see how this affects the regression models.

 

proc reg data=work.bases2;

model Salary=Measure1;

run;

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 1.54097E-16 0.05433 0.00 1.0000
measure1   1 0.47605 0.05443 8.75 <.0001

 

The first thing you might notice is that the parameter estimate for Measure1, 0.47605, is exactly the same as the Pearson correlation coefficient for Measure1 with Salary.  What is not obvious is that the intercept is now zero, or at least it should be.  The only reason it isn’t is because of lack of precision.  The value 1.54097E-16 is infinitesimally close to zero and should be zero if the parameters were estimated precisely.  Oh, well.

 

proc reg data=work.bases2;

model Salary=Measure2;

run;

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 1.18586E-16 0.06030 0.00 1.0000
measure2   1 0.21727 0.06042 3.60 0.0004

 

Similarly, the parameter estimate for Measure2 is the same as its Pearson correlation coefficient with Salary.  These results are not simply coincidence.  It is a result of the linear transformations involved in z-score standardization of both X and Y variables in a regression model – the simple regression model (using one explanatory variable) will have an intercept of zero and a parameter estimate equal to the Pearson correlation coefficient with the dependent variable.

 

Now, let me put this all together in one final model.

 

proc reg data=work.bases2;

model Salary=Measure1 Measure2;

run;

Analysis of Variance

Source DF Sum of Squares Mean Square F Value Pr > F
Model 2 71.74296 35.87148 49.02 <.0001
Error 260 190.25704 0.73176    
Corrected Total 262 262.00000      

 

Root MSE 0.85543 R-Square 0.2738
Dependent Mean 1.41838E-16 Adj R-Sq 0.2682
Coeff Var 6.031008E17    

 

Parameter Estimates

Variable Label DF Parameter Estimate Standard Error t Value Pr > |t|
Intercept Intercept 1 1.30845E-16 0.05275 0.00 1.0000
measure1   1 0.47605 0.05285 9.01 <.0001
measure2   1 0.21727 0.05285 4.11 <.0001

 

Once again, the F value and p-value from the Analysis of Variance table and the R-Square and Adjusted R-Square values are identical to the value that resulted from the first model I ran, where I’d used unstandardized variables, nHome and nRBI as explanatory variables.  The linear transformations involved in creating the two principal component scores didn’t change the predictive power of model.  Nor did standardization.

 

If you’re interested in learning how all of this relates to exploratory factor analysis, tune in later to my next blog post.  The important points to take from this post for factor analysis are that Pearson correlation coefficients are equal to regression parameter estimates (with a Y-intercept of zero) when all variables are on the standardized metric of zero mean and unit variance (variance of one) and that the adjusted parameter estimates (and therefore adjusted correlations with the Y-variable) are the same as the raw (unadjusted) parameter estimates when the explanatory variables are uncorrelated.  

Version history
Last update:
‎08-25-2023 11:15 AM
Updated by:
Contributors

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!

Free course: Data Literacy Essentials

Data Literacy is for all, even absolute beginners. Jump on board with this free e-learning  and boost your career prospects.

Get Started

Article Tags