Solved: Re: HPLogistic, Interpretation of Variable Coefficients

NKormanik · Posted 07-15-2021 08:11 PM

In the documentation for Proc HPLogistic, there is limited explanation on how to interpret the variable coefficients which result.

Example 5.1 Model Selection :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures

Wondering if someone can please shed more light.

Output 5.1.6: Parameter Estimates

Parameter Estimates
Parameter	Estimate	Standard Error	DF	t Value	Pr > \|t\|
Intercept	0.8584	0.5503	Infty	1.56	0.1188
x2	-0.2502	0.1146	Infty	-2.18	0.0290
x8	1.7840	0.7908	Infty	2.26	0.0241

So, for instance, the x8 coefficient of 1.7840, and x2 of -0.2502. How do you interpret those? How do they affect y?

Thanks much!

Nicholas Kormanik

FreelanceReinh · Posted 07-18-2021 08:10 AM

@NKormanik wrote:

If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?

Seems such a straight-forward request.

Finding "the best model" would be much easier if there was a single numeric criterion with increasing values for "better" models. Unfortunately, things aren't as simple as that. For example, a yes/no prediction can be wrong in two ways: false positive or false negative. Depending on the context, it might be more important to avoid the former than the latter or vice versa. But it's always a compromise because you cannot simply reduce both at the same time (cf. sensitivity vs. specificity, ROC curves and the c statistic in PROC LOGISTIC output). To assess the quality of a logistic regression model you can also look at the global test statistics or the model fit statistics. But a model that fits (too) well to the training data, typically with many predictors, might perform poorly on new data (the problem of "overfitting"). Now you can use training and validation data to address this issue, but, again, there's a multitude of criteria to rate model performance on the validation set. Finally, all these purely statistical measures ignore whether the selected predictors in the model are plausible from a subject matter point of view.

I haven't used all of the SAS procedures mentioned in your quote from the documentation (e.g., none requiring a SAS/ETS license). For logistic regression I used PROC LOGISTIC and PROC CATMOD. Special types of predictors (e.g., random effects) or characteristics of the data (e.g., complex survey data) will restrict the set of applicable procedures. Certain features such as automatic variable selection are available in some procedures (e.g., PROC LOGISTIC), but not in others (e.g., PROC CATMOD). Various procedures (e.g., PROC GLIMMIX) are designed for a broader range of statistical models, not just binary logistic regression. As a generic starting point I would recommend PROC LOGISTIC. For a real research project you may want to consult a statistician (at an early stage of the project).

View solution in original post

FreelanceReinh · Posted 07-16-2021 06:18 AM

Hello @NKormanik,

I think the interpretation is the same as for PROC LOGISTIC. In this particular example of a binary logistic regression the model estimates the logit log(P(Y=0)/P(Y=1)) as 0.8584 - 0.2502*x2 + 1.7840*x8. As a consequence, a unit increase in x2 -- everything else being the same -- would create an odds ratio of exp(-0.2502)=0.778...<1, meaning a lower "risk" (or "chance") of having Y=0 with larger values of x2. Similarly, if only x8 was increased by, say, 0.2, the odds ratio (compared to a population without that increase) would be estimated as exp(0.2*1.7840)=1.428...>1, meaning a higher risk of Y=0 with those increased x8 values.

Note, however, that this is only a model, so that the real odds ratios for concrete populations might differ considerably from these estimates. See also the confidence intervals for the coefficients (and how changes of them within the intervals would affect the odds ratio calculations), which can be requested with the CL option of the MODEL statement.

NKormanik · Posted 07-17-2021 03:22 AM

Makes sense somewhat, @FreelanceReinh and @StatDave

I've read such in other documentation.

But..., if we can simplify the results in the example as much as possible -- let's say, for a freshman in high school...

The other variables in the model -- x1, x3, x4, x5, x6, x7, x9, x10 -- since excluded from the chosen final model, we are to consider not having much value, as far as arriving at a high probability of achieving y. (Let's say y is, ohh, eventually earning a PhD, just for instance.)

ONLY x2 and x8 statistically point to whether or not success might be reached.

And then the coefficients:

x8	1.7840

x2	-0.2502

High levels of the variable x8 indicate better chances of Y?

Low levels of x2 indicate better chances of Y?

x8, having a much larger coefficient than x2, is far more important in possibly affecting outcome?

FreelanceReinh · Posted 07-17-2021 06:48 AM

@NKormanik wrote:

The other variables in the model -- x1, x3, x4, x5, x6, x7, x9, x10 -- since excluded from the chosen final model, we are to consider not having much value, as far as arriving at a high probability of achieving y. (Let's say y is, ohh, eventually earning a PhD, just for instance.)

ONLY x2 and x8 statistically point to whether or not success might be reached.

The maximization involved in binary logistic regression as performed by PROC (HP)LOGISTIC (i.e, maximum likelihood estimation) is not aimed at "arriving at a high probability of achieving y" (y=0 in your example). It is rather aimed at finding those coefficients which maximize the likelihood of obtaining the observed results, y=0 for some observations and y=1 for others, with the values of the predictors (such as x2 and x8) in the data. (So, the wording "statistically point to whether or not success might be reached" is more accurate.) The optimization is limited to models of the form "logit = linear function of predictors" -- no other possible relationships between Y and X1, ..., X10 are considered.
The process of (forward) variable selection (which is somewhat controversial) is governed by certain statistical criteria which may exclude a predictor because it doesn't add much explanatory value to the variables already included in the model. It is possible that one of the excluded variables could be a valuable model variable if it replaced one of the others. (Consider the extreme case that two predictors are in fact identical or the less extreme case that they're highly correlated.)
Different option settings in the code can lead to different variable selections. For example, x4 would be included in the model as a third predictor if the default entry criterion SLENTRY=0.05 was changed ("relaxed") to SLENTRY=0.1.

@NKormanik wrote:

And then the coefficients:

x8 1.7840

x2 -0.2502

High levels of the variable x8 indicate better chances of Y?

Low levels of x2 indicate better chances of Y?

"... better chances of Y=0" in your example. As mentioned earlier, the condition "everything else being the same" is important (but possibly hard to fulfill in reality). In the presence of potentially strong interactions between predictors such as x2 and x8 the combination of "promising" values of x2 and x8 might perform surprisingly poorly. Also, the estimated coefficients alone do not tell whether the model fits well to the data. And, in general, their confidence intervals could be so wide that they include both positive and negative values.

@NKormanik wrote:

x8, having a much larger coefficient than x2, is far more important in possibly affecting outcome?

No, you cannot draw this conclusion. Note that for continuous predictors like x1, ..., x10 a change of their measurement unit (i.e., really an arbitrary change) has a proportional impact on the model coefficients. For example, measuring x8 in one hundredths of the original unit means that all values of x8 in dataset getStarted must be multiplied by 100. Not surprisingly, this doesn't change the (absolute or relative) significance of x8 in the model at all. Yet the coefficient of x8 is now 1.7840/100=0.017840 (consistent with the meaning of a "unit change" in the new measurement unit), which has a smaller absolute value than the coefficient of x2 (which is unchanged).

NKormanik · Posted 07-18-2021 03:48 AM

Binary logistic regression model

Used to model a binary (two-level) response — for example, yes or no.
How to fit it: This model can be fit by many procedures, including the SAS/STAT procedures LOGISTIC and GENMOD (using asymptotic or exact conditional methods), CATMOD (using weighted least squares or maximum likelihood (ML)), HPLOGISTIC, PROBIT, GAM, GAMPL, GLIMMIX, SURVEYLOGISTIC, FMM and HPFMM (using ML or Bayesian estimation), GEE (beginning in SAS 9.4 TS1M2, uses Generalized Estimating Equations and provides robust standard error estimates), MCMC (using Bayesian estimation), ADAPTIVEREG, NLMIXED, HPNLMOD, HPGENSELECT; and the SAS/ETS^® procedures MDC and QLIM. The GAM, GAMPL, and ADAPTIVEREG procedures can fit more flexible logistic models by using spline or loess smoothers. The GLIMMIX, NLMIXED, MCMC, and (beginning in SAS 9.4 TS1M4) QLIM procedures allow the inclusion of random effects in the model. Longitudinal or repeated measures data can be modeled using the REPEATED statement in GENMOD or CATMOD, and using the RANDOM statement in GLIMMIX, GEE, MCMC, NLMIXED, or QLIM procedures.

Look at all the above possibilities, just for finding a binomial model. If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?

Seems such a straight-forward request.

One document I came across said that logistical regression was a better starting place than linear regression, in the study of statistics. Hmmm...

If you feel you've said enough, @FreelanceReinh , then I'll just mark your last response as the solution. Thanks!

FreelanceReinh · Posted 07-18-2021 08:10 AM

@NKormanik wrote:

If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?

Seems such a straight-forward request.

Finding "the best model" would be much easier if there was a single numeric criterion with increasing values for "better" models. Unfortunately, things aren't as simple as that. For example, a yes/no prediction can be wrong in two ways: false positive or false negative. Depending on the context, it might be more important to avoid the former than the latter or vice versa. But it's always a compromise because you cannot simply reduce both at the same time (cf. sensitivity vs. specificity, ROC curves and the c statistic in PROC LOGISTIC output). To assess the quality of a logistic regression model you can also look at the global test statistics or the model fit statistics. But a model that fits (too) well to the training data, typically with many predictors, might perform poorly on new data (the problem of "overfitting"). Now you can use training and validation data to address this issue, but, again, there's a multitude of criteria to rate model performance on the validation set. Finally, all these purely statistical measures ignore whether the selected predictors in the model are plausible from a subject matter point of view.

I haven't used all of the SAS procedures mentioned in your quote from the documentation (e.g., none requiring a SAS/ETS license). For logistic regression I used PROC LOGISTIC and PROC CATMOD. Special types of predictors (e.g., random effects) or characteristics of the data (e.g., complex survey data) will restrict the set of applicable procedures. Certain features such as automatic variable selection are available in some procedures (e.g., PROC LOGISTIC), but not in others (e.g., PROC CATMOD). Various procedures (e.g., PROC GLIMMIX) are designed for a broader range of statistical models, not just binary logistic regression. As a generic starting point I would recommend PROC LOGISTIC. For a real research project you may want to consult a statistician (at an early stage of the project).

StatDave · Posted 07-17-2021 04:48 PM

The model you are fitting has logit(y) as the response function, so a positive parameter estimate for a predictor means that it increases logit(y) by that amount for each unit increase in the predictor. Since logit(y) monotonically increases with the event probability defined on y, that means that increasing the predictor increases the event probability (though not by the amount of the parameter estimate - for that you need a marginal effect as provided by the Margins macro. See this note). So, a negative parameter estimate decreases the event probability. For assessing variable importance, see this note.

StatDave · Posted 07-16-2021 09:41 AM

See also the interpretation, in terms of the probability of the event, shown in the Getting Started section of the PROC LOGISTIC documentation since PROC LOGISTIC fits the same model.

Ready to join fellow brilliant minds for the SAS Hackathon?