- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@_maldini_ wrote:
Looking at the STDIZE Procedure documentation. I'm not sure how this solves my problem. My goal is to identify outliers (PROC UNIVARIATE shows extreme values, but it doesn't help me determine if they meet my definition of an outlier, e.g. = 1.5*IQR, 3STD*mean) for multiple variables and then remove them, if they should be removed.
If I'm understanding this correctly (big IF), in the STDIZE Procedure documentation, it uses PROC UNIVARIATE to find extreme values by group. Then uses PROC STDIZE with the STD method to compute location and scale measures (Don't we already have those from PROC UNIVARIATE?).
PROC STDIZE does not require you first run PROC UNIVARIATE. PROC STDIZE does its own calculations. It produces values that are standardized by whatever method you choose. THe program you are working with where the macro variables are causing problems uses what is the STD method in PROC STDIZE. Any standardized value produced by PROC STDIZE which is >3 or <–3 is a potential outlier, by the same calculations you were doing originally. With the STD method, there is no "tuning".
Where does it determine that "64" is actually an outlier and not just an extreme value? Because of its impact on the dispersion ratio? I don't see formulas for dispersion ratio for figure 109.1 or the tuning constant. Do I need these?
Statistical methods do not determine whether something is an outlier. You, the human user, have to determine if it is an outlier. Your original program was looking for observations in which the data value was more than 3 standard deviations from the mean. This is exactly what PROC STDIZE is doing. What you choose to do with that information is up to you.
The nice thing about PROC STDIZE is that it works on 50 variables if that's how many you have; it works on 500 if that's have many you have, and so on. And no macro is needed. (Of course, the macros discussed probably use more sophisticated methods to determine if something is an outlier)
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PaigeMiller If I'm understanding you correctly, the STD method in PROC STDIZE sets the cutoff for outliers at > mean + 3(std) or < mean – 3(std) and provides measures of location and scale that are "standardized" and thus resistant to outliers (Using whatever definition is associated w/ a given method).
When I use the data and syntax in the documentation, however, the mean and std is the same when using PROC UNIVARIATE (only the Extreme Observations table is provided in the documentation) and PROC STDIZE (Figure 109.3).
The extreme value in the example is 64 (Obs 23). 153.3 (mean w/ outlier) - 3(30.0667678)(std w/ outlier) = 61. So observation 23 does NOT meet the definition using the STD method, thus the mean is the same when using PROC UNIVARIATE and PROC STDIZE.
So I changed the value from 64 to 24, well below 61. When I do this, the standardized mean and std produced by PROC STDIZE are different from the unstandardized mean and std produced by PROC UNIVARIATE. This approach seems similar to trimming and winsorizing means, no?
The extreme values in PROC UNIVARIATE are helpful, but I don't know if they meet the definition of an outlier (as defined by me) unless I perform some calculations (e.g.,153.3 (mean w/ outlier) - 3(30.0667678)(std w/ outlier) = 63.1). I'm looking for an approach that tells me if certain values are outliers based on a definition I choose (similar to standardization methods in PROC STDIZE), so that I can evaluate those observations to determine what to do with them. It sounds like you think PROC STDIZE does this, but I don't see it in the documentation.
Thanks for your help.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@PaigeMiller If I'm understanding you correctly, the STD method in PROC STDIZE sets the cutoff for outliers at > mean + 3(std) or < mean – 3(std) and provides measures of location and scale that are "standardized" and thus resistant to outliers (Using whatever definition is associated w/ a given method).
No! PROC STDIZE does not set any cutoff for anything. You, the human user of the results, have to determine what cutoff for outliers makes sense to you. It could be ±4 or any other cutoff you want. You started this thread by talking about using ±3 cutoffs using mean and standard deviations.
The measures of location and scale using METHOD=STD are the standard measures of mean and standard deviations, they are not resistant to outliers. Other choices for METHOD= are resistant to outliers.
I'm looking for an approach that tells me if certain values are outliers based on a definition I choose (similar to standardization methods in PROC STDIZE), so that I can evaluate those observations to determine what to do with them. It sounds like you think PROC STDIZE does this, but I don't see it in the documentation.
Again, you started this thread looking at limits of mean±3stddev. That's what PROC STDIZE is giving you with METHOD=STD, it tells you how many standard deviations away from the mean the data point is. If that works for you, great, if you want more advanced methods then you should use those methods, which exist in both PROC UNIVARIATE and PROC STDIZE (and probably other places).
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
<That's what PROC STDIZE is giving you with METHOD=STD, it tells you how many standard deviations away from the mean the data point is.>
I don't see this in the documentation or the output. I see that PROC STDIZE computes location and scale measures - and that the methods other than STD compute measures resistant to outliers - but I don't see where it shows you how many standard deviations a given data point is from the mean.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@_maldini_ wrote:
but I don't see where it shows you how many standard deviations a given data point is from the mean.
What Is a Z-Score?
A Z-score is a numerical measurement that describes a value's relationship to the mean of a group of values. Z-score is measured in terms of standard deviations from the mean. If a Z-score is 0, it indicates that the data point's score is identical to the mean score. A Z-score of 1.0 would indicate a value that is one standard deviation from the mean. Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Reeza I understand what a Z-score is, but I don't understand what that has to do w/ PROC STDIZE. Are you suggesting another solution using Z-scores or are you saying PROC STDIZE provides Z-score?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
From the documentation of PROC STDIZE
Overview: STDIZE Procedure
The STDIZE procedure standardizes one or more numeric variables in a SAS data set by subtracting a location measure and dividing by a scale measure.
proc sort data=sashelp.class out=class;
by sex;
run;
proc stdize data=class sprefix=s_ oprefix=o_ out=s;
by sex;
var height weight;
run;
Variables s_height and s_weight in data set S are standardized values, the number of standard deviations from the mean.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Bet you copied code from somewhere on the web and didn't pay attention that a character was replaced with &
If you did not actually write & or even more the & in this
if age lt (&mean. - 3*&stddev.) or age gt (&mean. + 3*&stddev.) then output;
delete the & .
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- « Previous
-
- 1
- 2
- Next »