Solved: Re: Logistic regresion (proc logistics) vars preparation questions

juanvg1972 · Posted 08-05-2016 03:35 AM

Hi,

I am using logistic regresion to predict a target var type (1,0).
One of the vars of my model is a classificarion var.

a_type = ("high", "medium" , "low"), is a prediction var, not the target

I use proc logistics.

I don't know if it is recommended to transform this var in dummy vars like that:

a_type_high = (1,0)
a_type_medium = (1,0)
a_type_low = (1,0)

I suppose that kind of vars are better for logistic regression, isn't it?
If I don't transform the vars, does the proc do the transformation automatically?

Another question I also have several continuos/quantitative vars like sales (0-50), mkt_exp (0-1000)
do I have to no a normalization to transform in a var with avg=0 and std = 1?, is that needed?

Thanks

JasonXin · Posted 08-05-2016 10:34 AM

Hi, When you list the variable at Class statement under proc logistic, the Class statement option Param=EFFECT should do the dummy variable for you. The other commonly used option is Param=GLM. There is no fast rule as to which one is better. In using Proc logistic for predictive modeling, these two options are most popular. There are another ~8 options you may explore. That is if your work is more design matrix sensitive. One issue more important for you actually is missing value status on the categorical variable. There is a Missing option at the Class statement you can read. Generally Proc logistic has been optimized continuously so the user does not have to spend time coding stuff manually. As for normalization, the question really relates to if the model is sensitive to distribution of input variables, the interval variables. If the variable is too NOT normal, you should not normalize. Other factors include 1. your link function. Many link functions are distribution tolerant, but not all. 2. sample size. Many modelers tend to ignore normality of input variables when the model universe is big. 3. Really normality matters if univariate study of the input variable is critical for fitting the model: in fitting models like logistic regression, interactions among inputs are often more influential. 4. If one should normalize an interval input, the marginal improvement on its contribution towards the model's overall predictive accuracy tends to be: first, hard to measure. second, if measurable, tends to be insignificant. Hope this helps? Thanks for using SAS. Jason Xin

View solution in original post

Reeza · Posted 08-05-2016 03:48 AM

Categorical variables should be placed in the CLASS statement.

If it's your first time doing an analysis I like to find a worked example, work through that, then proceed to my data.

The documentation has a good example of analysis with categorical predictors.

Another resource:

http://www.ats.ucla.edu/stat/sas/dae/logit.htm

Normalization is up to you. If you choose to do so, look at proc stdize.

Are res you using SAS Enterprise Miner?

juanvg1972 · Posted 08-05-2016 11:04 AM

Thanks, I am using Enterprise Guide, no Miner.

I am not sure, when to standarize or not.

Thanks for your help

JasonXin · Posted 08-05-2016 10:34 AM

Hi, When you list the variable at Class statement under proc logistic, the Class statement option Param=EFFECT should do the dummy variable for you. The other commonly used option is Param=GLM. There is no fast rule as to which one is better. In using Proc logistic for predictive modeling, these two options are most popular. There are another ~8 options you may explore. That is if your work is more design matrix sensitive. One issue more important for you actually is missing value status on the categorical variable. There is a Missing option at the Class statement you can read. Generally Proc logistic has been optimized continuously so the user does not have to spend time coding stuff manually. As for normalization, the question really relates to if the model is sensitive to distribution of input variables, the interval variables. If the variable is too NOT normal, you should not normalize. Other factors include 1. your link function. Many link functions are distribution tolerant, but not all. 2. sample size. Many modelers tend to ignore normality of input variables when the model universe is big. 3. Really normality matters if univariate study of the input variable is critical for fitting the model: in fitting models like logistic regression, interactions among inputs are often more influential. 4. If one should normalize an interval input, the marginal improvement on its contribution towards the model's overall predictive accuracy tends to be: first, hard to measure. second, if measurable, tends to be insignificant. Hope this helps? Thanks for using SAS. Jason Xin

Reeza · Posted 08-05-2016 10:48 AM

In the CLASS statement, look at the parameterization options. AFAIK param = Ref is the most common, and most easily interpretable way of specifying your variables. Make sure you review the design matrix and understand your output.