Hi,
This can be tested experimentally, but I would like to prime the work with some discussion of variable distributions and primitive vs. advanced normalization. I have a distribution of times to be used in logistic regression. Majority of the times are short, so the distribution is skewed. I want to rank the times (i.e., bin them into groups of equal size) and use them as an ordinal variable.
Many thanks!
If you are saying that your question is about how to use a continuously-valued but positively skewed variable, TIME, as a predictor in a logistic regression model, then I suggest you simply use the variable, as is, in your model. Logistic regression makes no assumption on the distribution of the predictors. And categorizing a continuous variable simply throws away information that can be used to better model its effect. If the association between TIME and your response is complex, you can even use a spline on TIME (in the EFFECTS statement) to allow for as much flexibility as needed, or you could simply add higher order polynomial terms like quadratic, cubic, and so on. That cannot be done if the variable is categorized and used in the CLASS statement. Regarding the UNITS statement, that is only used to modify the odds ratio estimates computed after the model is fitted. The UNITS statement has no effect on how a predictor is used in the model.
Personally, I would try regularization/standardization for variables before binning.
@pink_poodle wrote:
Hi,
This can be tested experimentally, but I would like to prime the work with some discussion of variable distributions and primitive vs. advanced normalization. I have a distribution of times to be used in logistic regression. Majority of the times are short, so the distribution is skewed. I want to rank the times (i.e., bin them into groups of equal size) and use them as an ordinal variable.
- Is it the same as setting a unit change in time to group size in the UNITS statement?
- Ranked times will have a uniform distribution - is it legit to use such variable in PROC LOGISTICS?
- Should I be setting some sort of a reference level, or can I use the ranked time as a continuous-ordinal variable?
Many thanks!
If you are saying that your question is about how to use a continuously-valued but positively skewed variable, TIME, as a predictor in a logistic regression model, then I suggest you simply use the variable, as is, in your model. Logistic regression makes no assumption on the distribution of the predictors. And categorizing a continuous variable simply throws away information that can be used to better model its effect. If the association between TIME and your response is complex, you can even use a spline on TIME (in the EFFECTS statement) to allow for as much flexibility as needed, or you could simply add higher order polynomial terms like quadratic, cubic, and so on. That cannot be done if the variable is categorized and used in the CLASS statement. Regarding the UNITS statement, that is only used to modify the odds ratio estimates computed after the model is fitted. The UNITS statement has no effect on how a predictor is used in the model.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.