BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
pink_poodle
Barite | Level 11

Hi,

This can be tested experimentally, but I would like to prime the work with some discussion of variable distributions and primitive vs. advanced normalization. I have a distribution of times to be used in logistic regression. Majority of the times are short, so the distribution is skewed. I want to rank the times (i.e., bin them into groups of equal size) and use them as an ordinal variable.

  • Is it the same as setting a unit change in time to group size in the UNITS statement?
  • Ranked times will have a uniform distribution - is it legit to use such variable in PROC LOGISTICS?
  • Should I be setting some sort of a reference level, or can I use the ranked time as a continuous-ordinal variable?

Many thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ

If you are saying that your question is about how to use a continuously-valued but positively skewed variable, TIME, as a predictor in a logistic regression model, then I suggest you simply use the variable, as is, in your model. Logistic regression makes no assumption on the distribution of the predictors. And categorizing a continuous variable simply throws away information that can be used to better model its effect. If the association between TIME and your response is complex, you can even use a spline on TIME (in the EFFECTS statement) to allow for as much flexibility as needed, or you could simply add higher order polynomial terms like quadratic, cubic, and so on. That cannot be done if the variable is categorized and used in the CLASS statement. Regarding the UNITS statement, that is only used to modify the odds ratio estimates computed after the model is fitted. The UNITS statement has no effect on how a predictor is used in the model. 

View solution in original post

8 REPLIES 8
Reeza
Super User

Personally, I would try regularization/standardization for variables before binning. 


@pink_poodle wrote:

Hi,

This can be tested experimentally, but I would like to prime the work with some discussion of variable distributions and primitive vs. advanced normalization. I have a distribution of times to be used in logistic regression. Majority of the times are short, so the distribution is skewed. I want to rank the times (i.e., bin them into groups of equal size) and use them as an ordinal variable.

  • Is it the same as setting a unit change in time to group size in the UNITS statement?
  • Ranked times will have a uniform distribution - is it legit to use such variable in PROC LOGISTICS?
  • Should I be setting some sort of a reference level, or can I use the ranked time as a continuous-ordinal variable?

Many thanks!


 

pink_poodle
Barite | Level 11

@Reeza, could you please be more specific?

 

Reeza
Super User
Are you familiar with the concept of variable standardization and why having variables with various different units are problematic?

https://statisticsbyjim.com/regression/standardize-variables-regression/
pink_poodle
Barite | Level 11
Thank you, this is a very nice website. The times are all in minutes. If I subtract the mean from times to center them, many will be negative.
Reeza
Super User
A negative standardized value, means it's less than the average, which is informative in terms of comparisons and predictions, but may not make intuitive sense initially. For explanations and odds ratio, you could convert it back to the original scale though the non linear relationship makes it difficult to explain.

There are various standardization methods, not all would cause you to have negative values, e.g using a RANGE method. Here's a few from the PROC STDIZE documentation.
https://documentation.sas.com/doc/en/pgmsascdc/9.4_3.3/statug/statug_stdize_details01.htm

Are you looking to build a model for predictive power?
pink_poodle
Barite | Level 11
I already built it with a continuous variable. Just trying to determine if there are additional benefits to be had from using time as an ordinal variable.
Reeza
Super User
Typically you lose information when binning/categorizing variables as already mentioned by @StatDave.

StatDave
SAS Super FREQ

If you are saying that your question is about how to use a continuously-valued but positively skewed variable, TIME, as a predictor in a logistic regression model, then I suggest you simply use the variable, as is, in your model. Logistic regression makes no assumption on the distribution of the predictors. And categorizing a continuous variable simply throws away information that can be used to better model its effect. If the association between TIME and your response is complex, you can even use a spline on TIME (in the EFFECTS statement) to allow for as much flexibility as needed, or you could simply add higher order polynomial terms like quadratic, cubic, and so on. That cannot be done if the variable is categorized and used in the CLASS statement. Regarding the UNITS statement, that is only used to modify the odds ratio estimates computed after the model is fitted. The UNITS statement has no effect on how a predictor is used in the model. 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1518 views
  • 5 likes
  • 3 in conversation