Re: bulk outlier-identification and variable transformation

emaneman · Posted 02-17-2022 04:49 PM

Dear all,

I have used SAS for two decades to analyse data from behavioural studies, but now I have some NLP data that presents new challenges, and I am hoping to get some advise. I have 100 variables that correspond to linguistic indices that have been extracted by a corpus. Many of them have very skewed distribution and/or univariate outliers (let's define them as +/- 3SD). Is there a way to let SAS handle that in bulk, without me having to look at each variable and decide which transformation (log, SQRT, etc.) is needed, and identify, or even better, give missing values to outliers? I use SAS Studio.

Many thanks for your attention.

Eman

sbxkoenk · Posted 02-18-2022 12:48 PM

Hello Eman ,

With PROC STDIZE you can easily identify the observations that are more than 3 StdDev's from the mean (for all your

numeric variables at once).

With PROC BOXPLOT you can also identify outliers (for all your numeric variables at once)

, but the standard boxplot method to detect outliers is not robust against skewness.
Nevertheless you can try and check if you are satisfied with the result.
Use the OUTBOX= Data Set option and check the observations that get these _TYPE_s :

LOW Low outlier value
HIGH High outlier value
FARLOW Low far outlier value
FARHIGH High far outlier value

For the automatic transformation, I would use PROC TRANSREG to do a 'Univariate Box-Cox' transformation.
You can do a grid-search and the procedure finds the best lambda for you.

Cheers,

Koen

PaigeMiller · Posted 02-18-2022 12:50 PM

PROC STDIZE also offers some robust scaling methods.

--
Paige Miller

PaigeMiller · Posted 02-18-2022 12:53 PM

@emaneman wrote:

Dear all,

I have used SAS for two decades to analyse data from behavioural studies, but now I have some NLP data that presents new challenges, and I am hoping to get some advise. I have 100 variables that correspond to linguistic indices that have been extracted by a corpus. Many of them have very skewed distribution and/or univariate outliers (let's define them as +/- 3SD). Is there a way to let SAS handle that in bulk, without me having to look at each variable and decide which transformation (log, SQRT, etc.) is needed, and identify, or even better, give missing values to outliers? I use SAS Studio.

Many thanks for your attention.

Eman

Why do you state that the distribution is skewed? What analysis are you going to do with these variables after outlier removal and transformation. Some analyses don't care about outliers or skewed distributions.

Are you going to fit some sort of model? What type of model? What PROC? All of these things are needed to understand what reasonable outlier removal and transformation would be. If it is a linear regression, skewed x-variables and skewed y-variables may not be a problem. For other types of models or other types of analysis, skewed may be a problem. Tell us.

--
Paige Miller

emaneman · Posted 02-22-2022 07:26 AM

Thank you all, and sorry for the delayed response. I was out of commission for several days.

I am going run GLM and PROC MIXED. While some are quite robust and can handle outliers and non-normally distributed variables, in the journals I aim to publish this, and most importantly in similar articles that have published data like the one I have, the reporting of the details of the distribution are always required by reviewers and to say "it doesn't matter" just makes them more suspicious and stubborn in requesting the info...

sbxkoenk · Posted 02-22-2022 07:47 AM

Hello,

Agree.

You should know about your distribution(s) and potential outliers (univariate and multi-variate).

You should report about that as well.

But you should also know some techniques are more robust to outliers than others and some techniques can perfectly handle skewed inputs. No need to apply a transformation if it's not a necessity for the analysis you will be doing.

This being said :

What other information do you need from us?

Thanks,

Koen

emaneman · Posted 02-22-2022 07:54 AM

Hello Koen,

I am looking into TRANSREG. I tend to learn better by looking at examples, and those that are available on the online manual do not seem to include one about Univariate Box-Cox transformation. I will expand my search outside of the official SAS manuals for that.

Will that specific use of TRANSREG give me a pre and post-transformation distribution chart for each variable?

Eman

sbxkoenk · Posted 02-22-2022 08:02 AM

Hello,

Go to this page

SAS® 9.4 and SAS® Viya® 3.5 Programming Documentation
SAS/STAT User's Guide
The TRANSREG Procedure
Box-Cox Transformations
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_transreg_details02.htm?homeO...

and start reading at

The next example shows how to find a Box-Cox transformation without an independent variable. This seeks to normalize the univariate histogram. ...

title 'Univariate Box-Cox';

Koen

emaneman · Posted 02-22-2022 08:22 AM

Hello Koen,

that is terrific. Thank you for pinpointing so specifically, it saved me a lot of time.

I just tried it, with this syntax

proc transreg data=merged maxiter=0 nozeroconstant ;
   model BoxCox(anger) = identity(constante);
   output;
run;

But I am getting this error message:

ERROR: 14 invalid values were encountered while attempting to transform variable anger.

I have 14 observations that have value ZERO on variable anger, so I gather it is trying log transformation and getting undefined values for those 14 cases.

sbxkoenk · Posted 02-22-2022 08:35 AM

Hello,

Avoid testing lambda = 0 which is a mere log transformation.

Do a grid search on lambda below and above zero.

Koen

emaneman · Posted 02-22-2022 10:00 AM

great re restricting lambda values.

proc transreg data=merged maxiter=0 nozeroconstant TSTANDARD=NOMISS;
   model BoxCox(	anger /convenient lambda=-2 to -.05 by .05 lambda=.05 to 2 by 0.05 ) = identity(constante)  ;
   output;
run;

This works and gives me an output that i can easily use. See first

However, when I try to put multiple DV:

proc transreg data=merged maxiter=0 nozeroconstant TSTANDARD=NOMISS;
   model BoxCox(anger admiration love grief /convenient lambda=-2 to -.05 by .05 lambda=.05 to 2 by 0.05 ) = identity(constante)  ;
   output;
run;

The output file is different, and i do not have transformed values for each separate variable. It looks like this:

PaigeMiller · Posted 02-22-2022 07:48 AM

For GLM and MIXED, the distribution of the variables is irrelevant. It is the distribution of the residuals that are important. Sometimes, reviewers are wrong.

https://blogs.sas.com/content/iml/2018/08/27/on-the-assumptions-and-misconceptions-of-linear-regress...

Transformations are usually applied to the data to make the residuals (approximately) normally distributed, not to make the raw data normally distributed.

Yes, outliers can have a high impact on the results, and probably should be identified (and potentially removed). This has already been discussed in this thread, PROC STDIZE and PROC BOXPLOT can do this.

--
Paige Miller

emaneman · Posted 02-22-2022 07:58 AM

I cannot agree more with the statement about reviewers...

I have had cases in which normalizing a skewed DV (e.g. by SQRT) changed significantly the outcome of a simple GLM model.

"Transformations are usually applied to the data to make the residuals (approximately) normally distributed, not to make the raw data normally distributed." Any chance you could give me an example of this? Let's say you have the following syntax:

PROC GLM;
CLASS A B;
MODEL DV = A|B;

And that your residuals are not normally distributed. What would you do next?

SteveDenham · Posted 02-22-2022 08:01 AM

I would look at the data generating process to see what distributions may be appropriate. Tools for this would include graphical methods (QQ plots and histograms) and PROC TRANSREG.

SteveDenham

emaneman · Posted 02-22-2022 08:06 AM

Hello Steve, it very much looks like I need to learn the logic of TRANSREG and how to use it.

thank you.

Eman

SAS Innovate 2025: Save the Date