In the documentation for Proc HPLogistic, there is limited explanation on how to interpret the variable coefficients which result.
Example 5.1 Model Selection :: SAS/STAT(R) 12.3 User's Guide: High-Performance Procedures
Wondering if someone can please shed more light.
Output 5.1.6: Parameter Estimates
Parameter Estimates | |||||
---|---|---|---|---|---|
Parameter | Estimate | Standard Error |
DF | t Value | Pr > |t| |
Intercept | 0.8584 | 0.5503 | Infty | 1.56 | 0.1188 |
x2 | -0.2502 | 0.1146 | Infty | -2.18 | 0.0290 |
x8 | 1.7840 | 0.7908 | Infty | 2.26 | 0.0241 |
So, for instance, the x8 coefficient of 1.7840, and x2 of -0.2502. How do you interpret those? How do they affect y?
Thanks much!
Nicholas Kormanik
@NKormanik wrote:
If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?
Seems such a straight-forward request.
Finding "the best model" would be much easier if there was a single numeric criterion with increasing values for "better" models. Unfortunately, things aren't as simple as that. For example, a yes/no prediction can be wrong in two ways: false positive or false negative. Depending on the context, it might be more important to avoid the former than the latter or vice versa. But it's always a compromise because you cannot simply reduce both at the same time (cf. sensitivity vs. specificity, ROC curves and the c statistic in PROC LOGISTIC output). To assess the quality of a logistic regression model you can also look at the global test statistics or the model fit statistics. But a model that fits (too) well to the training data, typically with many predictors, might perform poorly on new data (the problem of "overfitting"). Now you can use training and validation data to address this issue, but, again, there's a multitude of criteria to rate model performance on the validation set. Finally, all these purely statistical measures ignore whether the selected predictors in the model are plausible from a subject matter point of view.
I haven't used all of the SAS procedures mentioned in your quote from the documentation (e.g., none requiring a SAS/ETS license). For logistic regression I used PROC LOGISTIC and PROC CATMOD. Special types of predictors (e.g., random effects) or characteristics of the data (e.g., complex survey data) will restrict the set of applicable procedures. Certain features such as automatic variable selection are available in some procedures (e.g., PROC LOGISTIC), but not in others (e.g., PROC CATMOD). Various procedures (e.g., PROC GLIMMIX) are designed for a broader range of statistical models, not just binary logistic regression. As a generic starting point I would recommend PROC LOGISTIC. For a real research project you may want to consult a statistician (at an early stage of the project).
Hello @NKormanik,
I think the interpretation is the same as for PROC LOGISTIC. In this particular example of a binary logistic regression the model estimates the logit log(P(Y=0)/P(Y=1)) as 0.8584 - 0.2502*x2 + 1.7840*x8. As a consequence, a unit increase in x2 -- everything else being the same -- would create an odds ratio of exp(-0.2502)=0.778...<1, meaning a lower "risk" (or "chance") of having Y=0 with larger values of x2. Similarly, if only x8 was increased by, say, 0.2, the odds ratio (compared to a population without that increase) would be estimated as exp(0.2*1.7840)=1.428...>1, meaning a higher risk of Y=0 with those increased x8 values.
Note, however, that this is only a model, so that the real odds ratios for concrete populations might differ considerably from these estimates. See also the confidence intervals for the coefficients (and how changes of them within the intervals would affect the odds ratio calculations), which can be requested with the CL option of the MODEL statement.
Makes sense somewhat, @FreelanceReinh and @StatDave
I've read such in other documentation.
But..., if we can simplify the results in the example as much as possible -- let's say, for a freshman in high school...
The other variables in the model -- x1, x3, x4, x5, x6, x7, x9, x10 -- since excluded from the chosen final model, we are to consider not having much value, as far as arriving at a high probability of achieving y. (Let's say y is, ohh, eventually earning a PhD, just for instance.)
ONLY x2 and x8 statistically point to whether or not success might be reached.
And then the coefficients:
x8 | 1.7840 |
---|
x2 | -0.2502 |
---|
High levels of the variable x8 indicate better chances of Y?
Low levels of x2 indicate better chances of Y?
x8, having a much larger coefficient than x2, is far more important in possibly affecting outcome?
@NKormanik wrote:
The other variables in the model -- x1, x3, x4, x5, x6, x7, x9, x10 -- since excluded from the chosen final model, we are to consider not having much value, as far as arriving at a high probability of achieving y. (Let's say y is, ohh, eventually earning a PhD, just for instance.)
ONLY x2 and x8 statistically point to whether or not success might be reached.
@NKormanik wrote:
And then the coefficients:
x8 1.7840
x2 -0.2502
High levels of the variable x8 indicate better chances of Y?
Low levels of x2 indicate better chances of Y?
"... better chances of Y=0" in your example. As mentioned earlier, the condition "everything else being the same" is important (but possibly hard to fulfill in reality). In the presence of potentially strong interactions between predictors such as x2 and x8 the combination of "promising" values of x2 and x8 might perform surprisingly poorly. Also, the estimated coefficients alone do not tell whether the model fits well to the data. And, in general, their confidence intervals could be so wide that they include both positive and negative values.
@NKormanik wrote:
x8, having a much larger coefficient than x2, is far more important in possibly affecting outcome?
No, you cannot draw this conclusion. Note that for continuous predictors like x1, ..., x10 a change of their measurement unit (i.e., really an arbitrary change) has a proportional impact on the model coefficients. For example, measuring x8 in one hundredths of the original unit means that all values of x8 in dataset getStarted must be multiplied by 100. Not surprisingly, this doesn't change the (absolute or relative) significance of x8 in the model at all. Yet the coefficient of x8 is now 1.7840/100=0.017840 (consistent with the meaning of a "unit change" in the new measurement unit), which has a smaller absolute value than the coefficient of x2 (which is unchanged).
Binary logistic regression model
Used to model a binary (two-level) response — for example, yes or no.
How to fit it: This model can be fit by many procedures, including the SAS/STAT procedures LOGISTIC and GENMOD (using asymptotic or exact conditional methods), CATMOD (using weighted least squares or maximum likelihood (ML)), HPLOGISTIC, PROBIT, GAM, GAMPL, GLIMMIX, SURVEYLOGISTIC, FMM and HPFMM (using ML or Bayesian estimation), GEE (beginning in SAS 9.4 TS1M2, uses Generalized Estimating Equations and provides robust standard error estimates), MCMC (using Bayesian estimation), ADAPTIVEREG, NLMIXED, HPNLMOD, HPGENSELECT; and the SAS/ETS® procedures MDC and QLIM. The GAM, GAMPL, and ADAPTIVEREG procedures can fit more flexible logistic models by using spline or loess smoothers. The GLIMMIX, NLMIXED, MCMC, and (beginning in SAS 9.4 TS1M4) QLIM procedures allow the inclusion of random effects in the model. Longitudinal or repeated measures data can be modeled using the REPEATED statement in GENMOD or CATMOD, and using the RANDOM statement in GLIMMIX, GEE, MCMC, NLMIXED, or QLIM procedures.
Look at all the above possibilities, just for finding a binomial model. If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?
Seems such a straight-forward request.
One document I came across said that logistical regression was a better starting place than linear regression, in the study of statistics. Hmmm...
If you feel you've said enough, @FreelanceReinh , then I'll just mark your last response as the solution. Thanks!
@NKormanik wrote:
If one seeks the best model for predicting whether someone (y) will complete a PhD, or not, which of the above methods is worth spending time with?
Seems such a straight-forward request.
Finding "the best model" would be much easier if there was a single numeric criterion with increasing values for "better" models. Unfortunately, things aren't as simple as that. For example, a yes/no prediction can be wrong in two ways: false positive or false negative. Depending on the context, it might be more important to avoid the former than the latter or vice versa. But it's always a compromise because you cannot simply reduce both at the same time (cf. sensitivity vs. specificity, ROC curves and the c statistic in PROC LOGISTIC output). To assess the quality of a logistic regression model you can also look at the global test statistics or the model fit statistics. But a model that fits (too) well to the training data, typically with many predictors, might perform poorly on new data (the problem of "overfitting"). Now you can use training and validation data to address this issue, but, again, there's a multitude of criteria to rate model performance on the validation set. Finally, all these purely statistical measures ignore whether the selected predictors in the model are plausible from a subject matter point of view.
I haven't used all of the SAS procedures mentioned in your quote from the documentation (e.g., none requiring a SAS/ETS license). For logistic regression I used PROC LOGISTIC and PROC CATMOD. Special types of predictors (e.g., random effects) or characteristics of the data (e.g., complex survey data) will restrict the set of applicable procedures. Certain features such as automatic variable selection are available in some procedures (e.g., PROC LOGISTIC), but not in others (e.g., PROC CATMOD). Various procedures (e.g., PROC GLIMMIX) are designed for a broader range of statistical models, not just binary logistic regression. As a generic starting point I would recommend PROC LOGISTIC. For a real research project you may want to consult a statistician (at an early stage of the project).
The model you are fitting has logit(y) as the response function, so a positive parameter estimate for a predictor means that it increases logit(y) by that amount for each unit increase in the predictor. Since logit(y) monotonically increases with the event probability defined on y, that means that increasing the predictor increases the event probability (though not by the amount of the parameter estimate - for that you need a marginal effect as provided by the Margins macro. See this note). So, a negative parameter estimate decreases the event probability. For assessing variable importance, see this note.
See also the interpretation, in terms of the probability of the event, shown in the Getting Started section of the PROC LOGISTIC documentation since PROC LOGISTIC fits the same model.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.