Dear all,
Here's the code.
1.
proc logistic data=slide.sb_vm_training outmodel=slide.model;
CLASS N2 N3 N4 N5 N6 N7 N10 N11 N12 N13 /param=ref;
model dv = Prin1 Prin2 Prin3 factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 /selection=stepwise ;
run;
2.
proc logistic data=slide.sb_vm_training outmodel=slide.model;
CLASS N2 N3 N4 N5 N6 N7 N10 N11 N12 N13 /param=effect;
model dv = Prin1 Prin2 Prin3 factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 /selection=stepwise ;
run;
3.
proc logistic data=slide.sb_vm_training outmodel=slide.model;
CLASS N2 N3 N4 N5 N6 N7 N10 N11 N12 N13 /param=ref;
model dv = Prin1 Prin2 Prin3 factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 /selection=stepwise ;
unit Prin1=Prin1 =50000 Prin2 =50000 Prin3 =50000
run;
I tried the 3 different coding as above, but all failed to get a N variable into the model
Where variables with prefix "N" are ordinal variables like nationality,sex,most of them with the scope (0,9)
Prin1-prin3 are variables extracted from principal analysis,the scope for this variable is between (-Million,+Million)
factor1-factor8 are variables extracted from factor analysis,the scope for this variable is between (-2,+2)
both of them are summary of continous variables in some way,
dv is the dependent variable , with 1 shows the customer will leave, and 0 shows he will stay.
The question is when using stepwise, only prin1 and some factor variables remains, not even one N variable remains.
while judge from the real business, at least nationality is very useful to determine whether a customer will leave,
WHY not even one N variable remains?
what's wrong with my coding for Proc Logistic?
Thanks in Advance.
Dawn
You have specified the variables with an N prefix in the CLASS statement but not as independent variables in the MODEL statement.
PROC LOGISTIC will select only independent variables from the MODEL statement so that it will not select any of the N-prefix variables.
You also state that the N-prefix variables are ordinal variables but provide as examples only nominal variables (nationality, sex).
Generally, you should not perform principal components analysis or factor analysis on nominal variables but preferably only on interval/ratio/continuous variables.
Reference coding is preferred to effect coding in the PROC LOGISTIC CLASS statement because the former is easier to translate into measures of effect (like odds ratios) than the latter. Variable selection in regression procedures has been discussed previously in this forum and is somewhat problematic. Preferable would be some of the methods in PROC GLMSELECT, even though these methods are not optimized for dichotomous dependent variables like those used in logistic regression.
You have specified the variables with an N prefix in the CLASS statement but not as independent variables in the MODEL statement.
PROC LOGISTIC will select only independent variables from the MODEL statement so that it will not select any of the N-prefix variables.
You also state that the N-prefix variables are ordinal variables but provide as examples only nominal variables (nationality, sex).
Generally, you should not perform principal components analysis or factor analysis on nominal variables but preferably only on interval/ratio/continuous variables.
Reference coding is preferred to effect coding in the PROC LOGISTIC CLASS statement because the former is easier to translate into measures of effect (like odds ratios) than the latter. Variable selection in regression procedures has been discussed previously in this forum and is somewhat problematic. Preferable would be some of the methods in PROC GLMSELECT, even though these methods are not optimized for dichotomous dependent variables like those used in logistic regression.
1zmm,
Thx for your patient explanation.
Yes that variables with N-prefix are nominal variables,
those Prin and factor variables are generated from continuous variables only.
After I raised this question, I looked around the community, to find that proc logistic combined with class defining is not recommended, they suggest glmselect as you said.
I have one more question,can you take time to reply it?
When i used the following code (45 continuous variables)
proc princomp data=slide.sb_vm10 cov outstat=temp_prin1;
var c1-c45;
run;
for eg variables A with large scope is within (-1M,1M),variables B with small scope is within (-1,1),
it seems that the coefficient for Eigenvectors like prin1 will be Zero for those variables B.
Do u know in mind how to deal with such things?
Thx in advance.
Of course the coefficient is zero, or nearly so. Variable A explains almost all of the total variation, so the amount of variation left for Variable B is negligible. If you look at the eigenvalues associated with the vectors this should be apparent.
The question comes down to RELATIVE variability, so perhaps rescaling would help. Not normalizing, as that will remove differences in variability. Just putting things on the same scale will help.
Although I don't really know how you intend to use the results in forecasting a timeseries.
Steve Denham
Steve,
Glad I see your recommend ,"not Normalizing" but just "rescaling", I was just to normalizing.You saved me.Thanks
Dawn
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.