New SAS User

Completely new to SAS or trying something new with SAS? Post here for help getting started.
BookmarkSubscribeRSS Feed
_maldini_
Barite | Level 11
The auto_outliers approach looks great, but I'm not sure where to put the Macro syntax used in the example. I don't know how to use Macros. Do I put it in a PROC step?
PaigeMiller
Diamond | Level 26

@_maldini_ wrote:

@PaigeMiller 

 

Looking at the STDIZE Procedure documentation. I'm not sure how this solves my problem. My goal is to identify outliers (PROC UNIVARIATE shows extreme values, but it doesn't help me determine if they meet my definition of an outlier, e.g. = 1.5*IQR, 3STD*mean) for multiple variables and then remove them, if they should be removed.

 

If I'm understanding this correctly (big IF), in the STDIZE Procedure documentation, it uses PROC UNIVARIATE to find extreme values by group. Then uses PROC STDIZE with the STD method to compute location and scale measures (Don't we already have those from PROC UNIVARIATE?).

 


PROC STDIZE does not require you first run PROC UNIVARIATE. PROC STDIZE does its own calculations. It produces values that are standardized by whatever method you choose. THe program you are working with where the macro variables are causing problems uses what is the STD method in PROC STDIZE. Any standardized value produced by PROC STDIZE which is >3 or <–3 is a potential outlier, by the same calculations you were doing originally. With the STD method, there is no "tuning".

 

Where does it determine that "64" is actually an outlier and not just an extreme value? Because of its impact on the dispersion ratio? I don't see formulas for dispersion ratio for figure 109.1 or the tuning constant. Do I need these?

 

Statistical methods do not determine whether something is an outlier. You, the human user, have to determine if it is an outlier. Your original program was looking for observations in which the data value was more than 3 standard deviations from the mean. This is exactly what PROC STDIZE is doing. What you choose to do with that information is up to you.

 

The nice thing about PROC STDIZE is that it works on 50 variables if that's how many you have; it works on 500 if that's have many you have, and so on. And no macro is needed. (Of course, the macros discussed probably use more sophisticated methods to determine if something is an outlier)

--
Paige Miller
_maldini_
Barite | Level 11

@PaigeMiller If I'm understanding you correctly, the STD method in PROC STDIZE sets the cutoff for outliers at > mean + 3(std) or < mean – 3(std) and provides measures of location and scale that are "standardized" and thus resistant to outliers (Using whatever definition is associated w/ a given method). 

 

When I use the data and syntax in the documentation, however, the mean and std is the same when using PROC UNIVARIATE (only the Extreme Observations table is provided in the documentation) and PROC STDIZE (Figure 109.3). 

 

The extreme value in the example is 64 (Obs 23). 153.3 (mean w/ outlier) - 3(30.0667678)(std w/ outlier) = 61. So observation 23 does NOT meet the definition using the STD method, thus the mean is the same when using PROC UNIVARIATE and PROC STDIZE.

 

So I changed the value from 64 to 24, well below 61. When I do this, the standardized mean and std produced by PROC STDIZE are different from the unstandardized mean and std produced by PROC UNIVARIATE. This approach seems similar to trimming and winsorizing means, no?

 

The extreme values in PROC UNIVARIATE are helpful, but I don't know if they meet the definition of an outlier (as defined by me) unless I perform some calculations (e.g.,153.3 (mean w/ outlier) - 3(30.0667678)(std w/ outlier) = 63.1). I'm looking for an approach that tells me if certain values are outliers based on a definition I choose (similar to standardization methods in PROC STDIZE), so that I can evaluate those observations to determine what to do with them. It sounds like you think PROC STDIZE does this, but I don't see it in the documentation.

 

Thanks for your help.

 

 

Reeza
Super User
Looking for values that are > mean + 3STD that is the mathematical equivalent of filtering values with a standardized value of 3+ or 3- when you standardize them using the Z score method.


PaigeMiller
Diamond | Level 26

@PaigeMiller If I'm understanding you correctly, the STD method in PROC STDIZE sets the cutoff for outliers at > mean + 3(std) or < mean – 3(std) and provides measures of location and scale that are "standardized" and thus resistant to outliers (Using whatever definition is associated w/ a given method). 

 

No! PROC STDIZE does not set any cutoff for anything. You, the human user of the results, have to determine what cutoff for outliers makes sense to you. It could be ±4 or any other cutoff you want. You started this thread by talking about using ±3 cutoffs using mean and standard deviations.

 

The measures of location and scale using METHOD=STD are the standard measures of mean and standard deviations, they are not resistant to outliers. Other choices for METHOD= are resistant to outliers.

 

I'm looking for an approach that tells me if certain values are outliers based on a definition I choose (similar to standardization methods in PROC STDIZE), so that I can evaluate those observations to determine what to do with them. It sounds like you think PROC STDIZE does this, but I don't see it in the documentation.

 

Again, you started this thread looking at limits of mean±3stddev. That's what PROC STDIZE is giving you with METHOD=STD, it tells you how many standard deviations away from the mean the data point is. If that works for you, great, if you want more advanced methods then you should use those methods, which exist in both PROC UNIVARIATE and PROC STDIZE (and probably other places).

 

--
Paige Miller
_maldini_
Barite | Level 11

@PaigeMiller 

<That's what PROC STDIZE is giving you with METHOD=STD, it tells you how many standard deviations away from the mean the data point is.>

 

I don't see this in the documentation or the output. I see that PROC STDIZE computes location and scale measures - and that the methods other than STD compute measures resistant to outliers - but I don't see where it shows you how many standard deviations a given data point is from the mean. 

Reeza
Super User

@_maldini_ wrote:

@PaigeMiller 

but I don't see where it shows you how many standard deviations a given data point is from the mean. 


What Is a Z-Score?

A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

 

https://www.investopedia.com/terms/z/zscore.asp

_maldini_
Barite | Level 11

@Reeza I understand what a Z-score is, but I don't understand what that has to do w/ PROC STDIZE. Are you suggesting another solution using Z-scores or are you saying PROC STDIZE provides Z-score? 

PaigeMiller
Diamond | Level 26

@_maldini_ 

 

From the documentation of PROC STDIZE

 

Overview: STDIZE Procedure

The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure.

 

proc sort data=sashelp.class out=class;
by sex;
run;

proc stdize data=class sprefix=s_ oprefix=o_ out=s;
by sex;
var height weight;
run;

Variables s_height and s_weight in data set S are standardized values, the number of standard deviations from the mean.

--
Paige Miller
ballardw
Super User

Bet you copied code from somewhere on the web and didn't pay attention that a character was replaced with &amp

If you did not actually write &amp or even more the &amp; in this

    if age lt (&amp;mean. - 3*&amp;stddev.) 
	or age gt (&amp;mean. + 3*&amp;stddev.) then output;

delete the &amp; . 

 

 

Reeza
Super User
The original source has the error. OPs first post includes the link to the source.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 25 replies
  • 3022 views
  • 20 likes
  • 5 in conversation