Solved: Re: Can I remove outliers with this command?

Dicarlis · Posted 11-03-2023 12:39 PM

My doubt is whether it is correct for me to consider the two factors of my experiment to remove the autiliers or do I just consider the main factor, follow the model I thought of using:

proc glm;

class BLOC factor1 factor2;

model GMDadap = BLOC factor1 factor1 * factor2;

output out = dois residual = x_res student = x_stu ;

run;

proc print;

run;

proc means n mean;

class factor1;

var gmdadap;

run;

proc univariate normal plot data=dois;

histogram x_stu/normal;

var x_stu ;

run;

SteveDenham · Posted 12-14-2023 09:37 AM

Ah ha! That 60 kg typifies the data quality issue that I mentioned. To my way of thinking, you could set an arbitrary cutoff and be more reasonable than by calculating a value using studentized residuals.

<dusts off Ph.D. notes on ruminant nutrition>The dry matter demand (DMD) in cattle varies due to several factors, but it turns out that the highest demand seen is in lactating multiparous dairy cows, where DMD may be as much as 4.5% of body weight per day. Since this is a study with confined cattle of unknown type or origin, a safe value would be to consider 5% of maximum body weight as a cutoff for the reasonable intake, so for a 500 kg animal, anything over 25 kg per day is almost certainly an error of some type. If you have as much noise in your automatic system as you imply, I suspect the 3 SD upper limit could exceed the 5%. So keep anything under whatever limit gets calculated based on body weight, and eliminate anything greater to get a reasonable biological estimate. <puts away notes>

After applying the cutoff, you are dealing with a truncated distribution. There are some procedures in SAS that enable fitting truncated distributions, so you might consider one of those for the analysis part of your question.

SteveDenham

View solution in original post

ballardw · Posted 11-03-2023 01:08 PM

Outliers for what variable(s)?

Remove for which step?

What constitutes an "outlier", as in rule(s), for each variable?

Dicarlis · Posted 11-03-2023 01:41 PM

Hello, thank you for your response. In our case, the variables are related to the performance of confined cattle (for example, average weight gain over 96 days). What we do is transform the residues to the Student scale (studentized residues) and we consider outiliers to be studentized residues greater than 3 or smaller than -3. however, we only performed this procedure if the residuals were not within normal limits using the Shapirowilk test, also performed for studentized residuals.

PaigeMiller · Posted 11-03-2023 02:17 PM

This is a reasonable way to approach outlier detection. Of course, there are plenty of other methods, including methods if the data is not normally distributed (such as box plot outliers), and even multivariate outlier detection. And possibly dozens of other methods.

--
Paige Miller

Dicarlis · Posted 11-03-2023 03:51 PM

Hello, thanks for the answers, this question was generated because some colleagues, when working with experiments in a factorial arrangement, only consider the main factor to remove outliers, however I believe that the interaction between the factors (which makes up our treatments) that must be considered, is this line of thinking right?

PaigeMiller · Posted 11-04-2023 10:43 AM

There's no universally agreed upon method of detecting outliers. I think if you are going to fit a model with two factors, the outliers in Y ought to be detected via residuals, which means to me that all terms in the model should be used.

--
Paige Miller

SteveDenham · Posted 12-12-2023 09:06 AM

I have read through all of the answers and it occurs to me - why do you want to remove extreme values? Unless you can identify a data quality issue, extreme values in animal growth experiments identify critical subjects, whose history should be investigated. For growth, it may be a food or water availability issue, for example. In any case, removing so-called outliers is going to increase your ability to detect differences, as the new dataset will almost certainly have a smaller mean squared error. But at what cost? Would you lose the ability to find an important effect of your treatment, such as increasing animal to animal variability?

Do the outliers identify a particular subpopulation with a different distribution? I would have a difficult time accepting a conclusion of increased growth rate, as an example, if you removed extreme low values from a treated group or extreme high values from a control. Since publications do not ordinarily include raw data, cases of exclusion like these could not be identified, and perhaps a conclusion of effectiveness will not be repeatable.

And all of this also applies to high-leverage points in a regression analysis.

So, if you exclude outliers, you should, in my opinion, disclose which points were excluded and why it was excluded.

SteveDenham

Rick_SAS · Posted 12-12-2023 09:24 AM

In addition to Steve's comments, I'd like to suggest a small change to the syntax on the MODEL statement. I assume that you did not include the main effect for factor2 because in the experimental design factor2 is nested inside factor1. In that case, I encourage you to use the "nested syntax"
model GMDadap = BLOC factor1 factor2(factor1);

Although the model is the same, the notation informs the reader that the main effect for factor2 is not necessary.

Dicarlis · Posted 12-14-2023 07:34 AM

Hi, thanks for all the responses!
We only remove outliers when the data is not normal or when there are collection problems. We adopt this method to be confident in not making type I or II errors. Our new problem is the amount of data, we are using a new automatic animal feeding system, and this system to generate a lot of data (for each moment that the animal ingests the diet, or interacts with the system). The system shows data that is impossible to be true (i.e., more than 60 kg of dry matter ingested per day). We don't know what causes these errors and how to remove the unreal values and if should do it.

SteveDenham · Posted 12-14-2023 09:37 AM

Ah ha! That 60 kg typifies the data quality issue that I mentioned. To my way of thinking, you could set an arbitrary cutoff and be more reasonable than by calculating a value using studentized residuals.

<dusts off Ph.D. notes on ruminant nutrition>The dry matter demand (DMD) in cattle varies due to several factors, but it turns out that the highest demand seen is in lactating multiparous dairy cows, where DMD may be as much as 4.5% of body weight per day. Since this is a study with confined cattle of unknown type or origin, a safe value would be to consider 5% of maximum body weight as a cutoff for the reasonable intake, so for a 500 kg animal, anything over 25 kg per day is almost certainly an error of some type. If you have as much noise in your automatic system as you imply, I suspect the 3 SD upper limit could exceed the 5%. So keep anything under whatever limit gets calculated based on body weight, and eliminate anything greater to get a reasonable biological estimate. <puts away notes>

After applying the cutoff, you are dealing with a truncated distribution. There are some procedures in SAS that enable fitting truncated distributions, so you might consider one of those for the analysis part of your question.

SteveDenham

Dicarlis · Posted 12-20-2023 12:33 PM

Hi Dr Denham, thans for yor awnser.
I really happie for you share the experiences for your Doctoral.
I understand and agree, now I don't understand the term fitting truncated distributions, how i do it ?

Sincerally,

SteveDenham · Posted 12-21-2023 11:05 AM

There are some PROCs that will enable you to fit a truncated normal distribution. I will leave the searching through the documentation to find some examples to you. Some starting places would be PROC QLIM in SAS/ETS and PROC FMM (and HPFMM) in SAS/STAT.

SteveDenham

Dicarlis · Posted 12-22-2023 11:58 AM

Thanks Dr Denhamm for your help,
I go to work in this data,

Sincerally,

Dicarlis · Posted 02-05-2024 12:12 AM

Hi, my new problem is:
In our trail, we need to control a lot of effects, example: animal, mother this animal, father this animal, season of bron (3 months rang) and pen.
We were using the 2x2 factorial array.
Each treatments are composed for 2 factors for each animal.
Our problem is that all this factors influence the response varible ex average gain daily (animal, mother this animal, father this animal, season of bron and pen).
Before we use the model :
PROC MIXED;
CLASS mother father factor1 factor2 animal born_season pen time;
model PV= factor1|factor2|time / DDFM=KR;
RANDOM mother father animal born_season pen time;
REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);
RUN;

But now we want to isolate the effects of the factors 1 and 2, we think that for this we need to use the covariate effect, The model looks like this:
PROC MIXED;
CLASS mother father factor1 factor2 animal born_season pen time;
model PV= factor1|factor2|time mother father animal born_season pen time / DDFM=KR;
RANDOM animal;
REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);
RUN;
Do you agree with this approach?

SteveDenham · Posted 02-05-2024 09:52 AM

That model statement probably is not what you want to do, as it restricts the inference space to only those members of the fixed effects, and ignores any interaction between factor1 or factor2 and the added covariates.

However, given the field you are working in, I would prefer to treat sire and dam as random effects, as they are likely representative of the population, and in other studies, handled as normal(0, variance of sire or dam). If you are able to accept that, then the following code may be of use:

PROC MIXED;
CLASS mother father factor1 factor2 animal born_season pen time;
model PV= factor1|factor2|time / DDFM=KR;
RANDOM mother father animal born_season pen ; /*removed time, to treat it solely as an R-side effect = correlated residuals)
REPEATED time/ TYPE = ar(1) SUBJECT = animal(factor1|factor2);
RUN;

There are two other issues to address. The first is the subject for the REPEATED statement. You can only use the nested syntax for the subject. In other words, factor1|factor2 is going to result in an error. I notice that in the RANDOM statement you specify 'animal'. That would imply (at least to me) that each animal on study has a unique identifier. If that is the case, I suggest using SUBJECT = animal in the REPEATED statement. If the animals are not uniquely identified, then you should nest animal within the proper effect, which is almost certainly not the levels of factor1, factor2 or their interaction.

The second subject is the use of the older Kenward-Rogers denominator degrees of freedom. This has been shown to have non-optimal qualities for algorithms that use second-order methods (NRRIDG, QUANEW, etc.). Consider the use of KR(FIRSTORDER) or KR2. If your data are unbalanced over the repeated measure, then you may see strange degrees of freedom associated with the use of KR2 (say 1 df for time). If that occurs, then you may need to explicitly state the degrees of freedom, and use an empirical shrinkage method.

SteveDenham

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away