Hello
I am referring the ANOVA and regression tutorial by SAS, and here is the code the tutor has used for identifying for potential outlier/influential obs
%let interval=Gr_Liv_area Basement_area Deck_porch_area Lot_area Age_sold Bedroom_abvGr Total_bathroom;
ods select none;
proc glmselect data=stat1.ameshousing3 plots=all;
Stepwise model saleprice = &interval / selection=stepwise details=steps select=SL slentry=0.05 slstay=0.05;
run;
quit;
ods select all;
ods graphics on;
ods output RSTUDENTBYPREDICTED=Rstud
COOKSDPLOT=Cook
DFFITSPLOT=Dffits
DFBETASPANEL=Dfbs;
proc reg data=stat1.ameshousing3
plots(only label)=
(RSTUDENTBYPREDICTED COOKSD DFFITS DFBETAS);
Siglimit: model salesprice=&_GLSIND;
title 'siglimit model plots of diagnostics stats';
run;
quit;
My question how can I identify potential outlier and influential obs, if I am working with a binary dependent variable and using proc logistic. I have a binary dependent variable where a bad customer coded as 0 and good coded as 1. Can you please help. Thanks
In practice, you can often use the binary response variable as the response variable in a linear regression model and it works surprisingly well. But don't tell anyone that I said that! 🙂
For linear regression, the influence diagnostics include the DFBETAS statistics, the DFFITS statistics, and Cook's distance (D). Some people also look at the leverage statistic (H). Similar "deletion diagnostics" statistics are available and documented in PROC LOGISTIC.
- The DFBETAS=_ALL_ option writes the DFBETAS to the output data set.
- The H= option outputs the leverage statistics
- There are various kinds of residuals in logistic models, so I'll let you read about the other options.
You can use the PLOTS=INFLUENCE option on the PROC LOGISTIC statement to get plots. You can use the INFLUENCE option on the MODEL statement to display a table.
sDo you check the documentation of PROC LOGISTIC ,especially its example .
Check Cbar and H(Cook D) statistic.
proc logistic data=want outest=est(keep=intercept &varlist);
model good_bad(event='good')= &varlist
/outroc=x.roc lackfit scale=none aggregate rsquare firth corrb /* selection=stepwise sle=0.1 sls=0.1*/ ;
output out=output h=h c=c cbar=cbar predicted=PredProb;
run;
proc sort data=output out=check_c ;
by descending c;
run;
proc sort data=output out=check_h ;
by descending h;
run;
All good points from @Rick_SAS and @Ksharp .
I would add that DFBETAS, DFFITS and Cook's D from PROC REG really don't apply in the logistic case where the response is binary or ordinal or nominal, because these statistics from PROC REG assume you have continuous Y values, and I would not trust them if Y is not continuous. On the other hand the H (leverage) statistic does not use the value of Y, so it doesn't matter is Y is continuous or not. The other diagnostic statistics from PROC LOGISTIC that have been mentioned all use the proper estimation (maximum likelihood) for the effect on the regression line which takes into account that the response is binary or ordinal or nominal.
Thank You for your help. It worked.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.