Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- Can I remove outliers with this command?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 11-03-2023 12:39 PM
(4293 views)

My doubt is whether it is correct for me to consider the two factors of my experiment to remove the autiliers or do I just consider the main factor, follow the model I thought of using:

proc glm;

class BLOC factor1 factor2;

model GMDadap = BLOC factor1 factor1 * factor2;

output out = dois residual = x_res student = x_stu ;

run;

proc print;

run;

proc means n mean;

class factor1;

var gmdadap;

run;

proc univariate normal plot data=dois;

histogram x_stu/normal;

var x_stu ;

run;

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ah ha! That 60 kg typifies the data quality issue that I mentioned. To my way of thinking, you could set an arbitrary cutoff and be more reasonable than by calculating a value using studentized residuals.

<dusts off Ph.D. notes on ruminant nutrition>The dry matter demand (DMD) in cattle varies due to several factors, but it turns out that the highest demand seen is in lactating multiparous dairy cows, where DMD may be as much as 4.5% of body weight per day. Since this is a study with confined cattle of unknown type or origin, a safe value would be to consider 5% of maximum body weight as a cutoff for the reasonable intake, so for a 500 kg animal, anything over 25 kg per day is almost certainly an error of some type. If you have as much noise in your automatic system as you imply, I suspect the 3 SD upper limit could exceed the 5%. So keep anything under whatever limit gets calculated based on body weight, and eliminate anything greater to get a reasonable biological estimate. <puts away notes>

After applying the cutoff, you are dealing with a truncated distribution. There are some procedures in SAS that enable fitting truncated distributions, so you might consider one of those for the analysis part of your question.

SteveDenham

14 REPLIES 14

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Outliers for what variable(s)?

Remove for which step?

What constitutes an "outlier", as in rule(s), for each variable?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello, thank you for your response. In our case, the variables are related to the performance of confined cattle (for example, average weight gain over 96 days). What we do is transform the residues to the Student scale (studentized residues) and we consider outiliers to be studentized residues greater than 3 or smaller than -3. however, we only performed this procedure if the residuals were not within normal limits using the Shapirowilk test, also performed for studentized residuals.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

This is a reasonable way to approach outlier detection. Of course, there are plenty of other methods, including methods if the data is not normally distributed (such as box plot outliers), and even multivariate outlier detection. And possibly dozens of other methods.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hello, thanks for the answers, this question was generated because some colleagues, when working with experiments in a factorial arrangement, only consider the main factor to remove outliers, however I believe that the interaction between the factors (which makes up our treatments) that must be considered, is this line of thinking right?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

There's no universally agreed upon method of detecting outliers. I think if you are going to fit a model with two factors, the outliers in Y ought to be detected via residuals, which means to me that all terms in the model should be used.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I have read through all of the answers and it occurs to me - why do you want to remove extreme values? Unless you can identify a data quality issue, extreme values in animal growth experiments identify critical subjects, whose history should be investigated. For growth, it may be a food or water availability issue, for example. In any case, removing so-called outliers is going to increase your ability to detect differences, as the new dataset will almost certainly have a smaller mean squared error. But at what cost? Would you lose the ability to find an important effect of your treatment, such as increasing animal to animal variability?

Do the outliers identify a particular subpopulation with a different distribution? I would have a difficult time accepting a conclusion of increased growth rate, as an example, if you removed extreme low values from a treated group or extreme high values from a control. Since publications do not ordinarily include raw data, cases of exclusion like these could not be identified, and perhaps a conclusion of effectiveness will not be repeatable.

And all of this also applies to high-leverage points in a regression analysis.

So, if you exclude outliers, you should, in my opinion, disclose which points were excluded and why it was excluded.

SteveDenham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

In addition to Steve's comments, I'd like to suggest a small change to the syntax on the MODEL statement. I assume that you did not include the main effect for factor2 because in the experimental design factor2 is nested inside factor1. In that case, I encourage you to use the "nested syntax"**model GMDadap = BLOC factor1 factor2(factor1);**

Although the model is the same, the notation informs the reader that the main effect for factor2 is not necessary.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

We only remove outliers when the data is not normal or when there are collection problems. We adopt this method to be confident in not making type I or II errors. Our new problem is the amount of data, we are using a new automatic animal feeding system, and this system to generate a lot of data (for each moment that the animal ingests the diet, or interacts with the system). The system shows data that is impossible to be true (i.e., more than 60 kg of dry matter ingested per day). We don't know what causes these errors and how to remove the unreal values and if should do it.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Ah ha! That 60 kg typifies the data quality issue that I mentioned. To my way of thinking, you could set an arbitrary cutoff and be more reasonable than by calculating a value using studentized residuals.

<dusts off Ph.D. notes on ruminant nutrition>The dry matter demand (DMD) in cattle varies due to several factors, but it turns out that the highest demand seen is in lactating multiparous dairy cows, where DMD may be as much as 4.5% of body weight per day. Since this is a study with confined cattle of unknown type or origin, a safe value would be to consider 5% of maximum body weight as a cutoff for the reasonable intake, so for a 500 kg animal, anything over 25 kg per day is almost certainly an error of some type. If you have as much noise in your automatic system as you imply, I suspect the 3 SD upper limit could exceed the 5%. So keep anything under whatever limit gets calculated based on body weight, and eliminate anything greater to get a reasonable biological estimate. <puts away notes>

After applying the cutoff, you are dealing with a truncated distribution. There are some procedures in SAS that enable fitting truncated distributions, so you might consider one of those for the analysis part of your question.

SteveDenham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi Dr Denham, thans for yor awnser.

I really happie for you share the experiences for your Doctoral.

I understand and agree, now I don't understand the term fitting truncated distributions, how i do it ?

Sincerally,

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

There are some PROCs that will enable you to fit a truncated normal distribution. I will leave the searching through the documentation to find some examples to you. Some starting places would be PROC QLIM in SAS/ETS and PROC FMM (and HPFMM) in SAS/STAT.

SteveDenham

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Thanks Dr Denhamm for your help,

I go to work in this data,

Sincerally,

I go to work in this data,

Sincerally,

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Hi, my new problem is:

In our trail, we need to control a lot of effects, example: animal, mother this animal, father this animal, season of bron (3 months rang) and pen.

We were using the 2x2 factorial array.

Each treatments are composed for 2 factors for each animal.

Our problem is that all this factors influence the response varible ex average gain daily (animal, mother this animal, father this animal, season of bron and pen).

Before we use the model :

PROC MIXED;

CLASS mother father factor1 factor2 animal born_season pen time;

model PV= factor1|factor2|time / DDFM=KR;

RANDOM mother father animal born_season pen time;

REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);

RUN;

But now we want to isolate the effects of the factors 1 and 2, we think that for this we need to use the covariate effect, The model looks like this:

PROC MIXED;

CLASS mother father factor1 factor2 animal born_season pen time;

model PV= factor1|factor2|time mother father animal born_season pen time / DDFM=KR;

RANDOM animal;

REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);

RUN;

Do you agree with this approach?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

That model statement probably is not what you want to do, as it restricts the inference space to only those members of the fixed effects, and ignores any interaction between factor1 or factor2 and the added covariates.

However, given the field you are working in, I would prefer to treat sire and dam as random effects, as they are likely representative of the population, and in other studies, handled as normal(0, variance of sire or dam). If you are able to accept that, then the following code may be of use:

```
PROC MIXED;
CLASS mother father factor1 factor2 animal born_season pen time;
model PV= factor1|factor2|time / DDFM=KR;
RANDOM mother father animal born_season pen ; /*removed time, to treat it solely as an R-side effect = correlated residuals)
REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);
RUN;
```

There are two other issues to address. The first is the subject for the REPEATED statement. You can only use the nested syntax for the subject. In other words, factor1|factor2 is going to result in an error. I notice that in the RANDOM statement you specify 'animal'. That would imply (at least to me) that each animal on study has a unique identifier. If that is the case, I suggest using SUBJECT = animal in the REPEATED statement. If the animals are not uniquely identified, then you should nest animal within the proper effect, which is almost certainly not the levels of factor1, factor2 or their interaction.

The second subject is the use of the older Kenward-Rogers denominator degrees of freedom. This has been shown to have non-optimal qualities for algorithms that use second-order methods (NRRIDG, QUANEW, etc.). Consider the use of KR(FIRSTORDER) or KR2. If your data are unbalanced over the repeated measure, then you may see strange degrees of freedom associated with the use of KR2 (say 1 df for time). If that occurs, then you may need to explicitly state the degrees of freedom, and use an empirical shrinkage method.

SteveDenham

**SAS Innovate 2025** is scheduled for May 6-9 in Orlando, FL. Sign up to be **first to learn** about the agenda and registration!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Ready to level-up your skills? Choose your own adventure.