Confounding What and How

1 Like

Imagine that you are given a new data set and a request to create a model that predicts the value of a response using several predictor variables. You start your exploration into the relationships between the predictors and the response. Someone then asks you if any of your variables are confounding with one another. Would you know what they are asking? How would you determine if there was confounding in your data? In this post, we will discuss the what and the how of confounding. We will present two possible ways to determine confounding with nominal/categorical variables and briefly mention Simpson’s Paradox.

It is important to not confuse statistical confounding with the presence of a statistical interaction. Statistical confounding is when a covariate is associated with both the response and another predictor variable. The estimate of the effect of the primary predictor variable to the response is distorted because it is mixed with the effect of the confounder. Confounding can be detected by noting changes in the parameter estimates when the covariate is added and when it is removed. Statistical interaction occurs when the effect of one covariate varies at different levels of another covariate. Interactions can be detected using hypothesis testing of a higher-order term that involves both covariates.

Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.

Let’s look at a visual representation of confounding. In the first image, where gender is in our model alone, we see that the average response is four units higher for females compared to males. In the second image, we include age in the analysis with gender. The result is now that the average response for females is approximately 1.3 units lower than compared to males. This comparison is made at the average age of 35. When age is present versus absent from the model, we see the relationship between gender and the response change. Depending on the extent of this change, this could be indicative of confounding.

How does this compare to an interaction visually? In this image, we see that there is a positive relationship between age and the response for males and a negative relationship between age and the response for females. This is a change of the relationship between age and response at different levels of gender. This is an interaction.

So how do we assess confounding? There are two ways that people can proceed. One is to use stratified contingency tables and the other is to perform the Delta method.

With stratified contingency tables, we will compare the crude odds ratios with the adjusted odds ratios. Crude odds ratios are the ones calculated when the potentially confounding variable is absent from the analysis. The adjusted odds ratio is calculated when the potentially confounding variable is present in the analysis. In the following diagram, the crude odds ratio is found when just the predictor UI is assessed against the response variable LOW. The adjusted odds ratio is found when we stratify on SMOKE.

We can get this information from a PROC FREQ crosstabulation table output. Typically, the row variable is the predictor of interest, and the column variable is the response. In the TABLES line, the row is named first and the column second. The RELRISK option is what produces the table containing the crude odds ratio.

proc freq data=birth;
   tables UI*LOW / RELRISK;
   title "Association Between Smoking and Low Birth Weight";
run;

The adjusted odds ratio will be calculated from the stratified tables where the stratification is made on the variable SMOKE. This is also available in PROC FREQ output. SMOKE is the variable we think may be confounding with UI. On the TABLES line, we start with SMOKE as the stratification variable. We then follow with the UI*LOW part as before. The order is stratification*row*column. The RELRISK option present the odds ratios for each stratified grouping, but it is the CMH option that generates the Cochran-Mantel-Haenszel statistic and the adjusted odds ratio.

proc freq data=birth;
   tables SMOKE*UI*LOW / CMH RELRISK;
   title "Association Between Smoking and Low Birth Weight";
run;

If there is a large change between the crude and the adjusted odds ratios, particularly in the significance, then we conclude there to be a confounding between our two predictor variables. In our example, the crude odds ratio is 2.5778 and the adjusted odds ratio is 2.4570. Both 95 percent confidence intervals do not contain the value of 1. I am not seeing the presence of confounding between SMOKE and UI.

Alternatively, we can leave PROC FREQ and perform this assessment of confounding using our regression procedures, like PROC LOGISTIC. This will have us performing the Delta method. In this case, we run the regression analysis both with and without the potential confounding predictor. The leading ODS statement is requesting that the parameter estimates table of all the following logistic procedures be saved to data sets. The first one will be named parms. All subsequent ones will be named parms1 and so forth. We request a logistic regression both with and without the variable SMOKE. Not only is this checking a confounding with UI but also with all other variables in the model line.

ods output parameterestimates(match_all persist=proc)=work.parms;
proc logistic data=birth;
   class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first;
   model LOW(event='Yes') = AGE LWT ETH SMOKE PTL HT UI FTV;
   title 'Full Model';
run;

proc logistic data=birth;
   class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first;
   model LOW(event='Yes') = AGE LWT ETH PTL HT UI FTV;
   title 'Model Removing Smoke';
run;
ods output close;

We save our output from the regressions, sort, and then use PROC COMPARE to see the change in the values of the parameter estimates and also the p-values. The variable classval0 is used when you have categorical variables in your model. This term contains the level of the categorical variable.

proc sort data=work.parms;
   by variable classval0;
run;

proc sort data=work.parms1;
   by variable classval0;
run;

proc compare base=work.parms compare=work.parms1 ;
   id variable classval0;
   var estimate probchisq;
run;

The PROC COMPARE code will compare the parameter estimates (estimate) and the p-values (probchisq) between the two saved output data sets, parms and parms1. What we are looking for is that there is a change in the parameter estimates of more than 10 percent. The direction of this change is not the focus. We can also look for a 10 percent change in the p-values but we will also need to see a change in the significance in addition to this percent of change.

In our example, look at UI. The change in the parameter estimate is only 6 percent, however, the change in the p-value is 23 percent. Despite the change in p-value being larger than 10 percent, the significance did not change between the two runs. We would say that the presence of confounding is not detected.

If confounding is determined to be present, then what do you do? The truer relationship between your predictor of interest and the response is viewed when the confounding variable is present in the model. This is regardless of the significance of the confounding variable. You must fight the urge to remove the potentially non-significant confounding predictor from the model. What could happen if you do remove or ignore this confounder? You may find yourself dealing with Simpson’s Paradox. This is when you have one result when the confounder is ignored and a different result when you account for this confounding. For an excellent example of Simpson’s Paradox, see the blog post by Rick Wicklin.

Before checking for confounding, we first need to check if the relationship between the two variables is an interaction. To check for an interaction, we can run a regression analysis and put the interaction effect into the model. If the interaction is significant, we do not have to follow up with a check for confounding.

proc logistic data=birth;
   class ETH(ref='3') SMOKE PTL HT UI FTV(param=ordinal) / param=ref ref=first;
   model LOW(event='Yes') = AGE LWT ETH SMOKE PTL HT UI FTV SMOKE*UI;
   title 'Testing Interaction of UI and SMOKE';
run;

The interaction is not significant, so that means you can then check for confounding using either of the two methods previously described.

With these two methods for determining the presence of confounding explained, I hope that you are no longer confounded by the concept of confounding. Always keep this topic in mind when you are performing your regressions as you never know where confounding may be lurking.

Find more articles from SAS Global Enablement and Learning here.

Confounding What and How

Registration is open

SAS AI and Machine Learning Courses