BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
fatemeh
Quartz | Level 8

Hello all,

I have a multivariable data set my response is 2 categorical variable(good, bad) and all independent variables are numerical and 60 observations. Because my response is categorical and non numerical,  can i apply proc robust to detect influential points using this sas code  ?

data mydata;
   set mydata;
   y=ranuni(3);
   run;
proc robustreg data=mydata method=lts;
   model y = t1-t7 / diagnostics leverage;
   run;
proc logistic data=mydataset descending;
model Y=var1 var2 var3 var4 var5 var6 var7/ plcl plrl waldcl waldrl
                           lackfit rsq
                           influence iplots
                           itprint;
ods output influence=myinfluence;
run;

can i change the cut off value based on 90 percentile for detecting outlier and leverage applying logistic regression and robust regression?

Any help will be appreciated.

1 ACCEPTED SOLUTION

Accepted Solutions
Rick_SAS
SAS Super FREQ

First, make sure you have read the last section of this article about PROC ROBUSTREG: "Detecting outliers in SAS: Part 3: Multivariate location and scatter."

The article advises that you use the DIAGNOSTICS and the LEVERAGE(MCDINFO)  options on the MODEL statement in PROC ROBUSTREG.  As you say, since high-leverage (influential) points are in the space of explanatory variables, the Y variable does not matter, so you can use a random variable. According to the ROBUSTREG documentation, you can control the cutoff by using the CUTOFFALPHA suboption like this:

LEVERAGE(CUTOFFALPHA=0.1 MCDINFO)

 

For the LOGISTIC model, I'm not sure what statistic you are trying to control. The influence diagnostics? I suggest you try changing the ALPHA= option on the PROC LOGISTIC statement. If that doesn't work, report back and we can think about it some more.

 

 

View solution in original post

5 REPLIES 5
Rick_SAS
SAS Super FREQ

First, make sure you have read the last section of this article about PROC ROBUSTREG: "Detecting outliers in SAS: Part 3: Multivariate location and scatter."

The article advises that you use the DIAGNOSTICS and the LEVERAGE(MCDINFO)  options on the MODEL statement in PROC ROBUSTREG.  As you say, since high-leverage (influential) points are in the space of explanatory variables, the Y variable does not matter, so you can use a random variable. According to the ROBUSTREG documentation, you can control the cutoff by using the CUTOFFALPHA suboption like this:

LEVERAGE(CUTOFFALPHA=0.1 MCDINFO)

 

For the LOGISTIC model, I'm not sure what statistic you are trying to control. The influence diagnostics? I suggest you try changing the ALPHA= option on the PROC LOGISTIC statement. If that doesn't work, report back and we can think about it some more.

 

 

fatemeh
Quartz | Level 8

Thanks a lot for this fantastic and helpful article  "Detecting outliers in SAS: Part 3: Multivariate location and scatter."

As it is mentioned in this article, y is defined random normal and high-leverage (influential) points are in the space of explanatory variables, the Y variable does not matter,  

y=rannor(1);

I have 3 questions:

1. Can i use "y=ranuni(1); " instead of normal distribution ?

2. About QUANTILE=n, what is quantile? Is that the same quantile that we get from proc univariate, for example 75% Q3, those observation that values for independent variables are larger than 75%Q3 and not using MCD distance? I read the definition of quantile and alpha in sas 9.4 but it is not clear to me!

3. When i applied LEVERAGE(CUTOFFALPHA=0.1 MCDINFO),log gave me warning:

"WARNING: The behavior of the leverage CUTOFFALPHA option has
changed from previous releases. To revert to the
previous behavior, specify the same value for both the
CUTOFFALPHA and the MCDALPHA options."
 Any help will be appreciated. 

 

 
 

 

 

Rick_SAS
SAS Super FREQ

1. Yes, although RANNOR and RANUNI are deprecated, so start using RAND("Normal") or RAND("Uniform")

2. This is the critical value of the test statistic. If the squared Mahalanobis distance for an observation exceeds the critical value, you call it an outlier. Because the squared MD follows a distribution that is approximately chi-squared, you can use an extreme quantile of the chi-square distribution to set the critical value.

3. I believe the warning is telling you that long ago the CUTOFFALPHA= option was used for two purposes: leverage detection and the "final MCD reweighting step." Now there are two options that each control one thing. Since you are only interested in leverage detection, you can ignore the warning (or specify the MCDALPHA= option if you want the warning to go away).

 

fatemeh
Quartz | Level 8

Hello,

I really appreciate you to help me find answer to my questions. Another question i have is that, how to find only high extreme leverages or only low extreme leverages when we apply robust regression and MCD algorithm? Because leverage data contain extreme low and extreme high data points,  is it correct way to find those leverages that have at least one value larger than for example 85 percentile (or what percentile is appropriate?), that way we can get leverages that are extreme high data points   ?  or is there another standard way to find that?

Rick_SAS
SAS Super FREQ

Yes, you have the correct understanding. By making the value CUTOFFALPHA= very small, the quantile will be big and only very extreme outliers will be "detected." 

 

Remember that the definition of an outlier depends on the distribution of the data. A small value such as CUTOFFALPHA=0.002 will classify a point as an outlier if the robust Mahalanobis distance to the robust mean is much greater than would be expected for multivariate normal data with that estiamted mean/covariance.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1280 views
  • 3 likes
  • 2 in conversation