I have categorical variables (Education and Gender) and numerical variables (Age , Income and No.of policies). Age and Income should be grouped as below. I want to run a logistic model on the below data. Which of the following codes is correct? So basically my question is ,should the categorical variables education and gender , binned numeric variables (age and income) be included as class variables explicitly (code 2) or can all of these variables be put together as done in code 1? Target Age Income Education Gender Policy_count 1 26 500000 Graduate F 2 0 38 300000 Graduate M 4 0 42 1000000 Post Graduate F 3 0 68 2200000 Post Graduate M 5 0 18 65000 12th M 1 0 71 3500000 Post Graduate M 2 1 40 2400000 Post Graduate M 2 1 43 5000000 Post Graduate M 1 1 52 7000000 Post Graduate M 5 0 61 10000000 PHD F 7 0 33 650000 Graduate M 3 0 14 80000 10th M 1 0 58 200000 Graduate M 4 Age_Group Income_group <20 <100000 20-30 100000-500000 30-40 500000-1000000 40-50 1000000-2000000 50-60 >2000000 >60 /*Code 1:*/
proc logistic data=test descending
plots(only)=(roc(id=obs) effect) PLOTS(MAXPOINTS=NONE)
namelen=34 outmodel=Logistic_Result;
model target=
Age_Group
Income_group
Policy_count
Gender
Education
/ selection=stepwise
slentry=0.05
slstay=0.05
outroc = ROC_Stats
lackfit rsq stb;
output out =pred p=phat ;
run;
/*Code 2:*/
PROC LOGISTIC DATA=test
Namelen=34 PLOTS(ONLY)=ALL;
CLASS age_group (PARAM=EFFECT) income_group (PARAM=EFFECT) Gender (PARAM=EFFECT) Education (PARAM=EFFECT)
Model target (event=’1’)= Policy_count
/ SELECTION=STEPWISE
SLE=0.05
SLS=0.05
LACKFIT
LINK=LOGIT
CLPARM=WALD
CLODDS=WALD
ALPHA=0.05;
RUN;
... View more