BookmarkSubscribeRSS Feed
emaneman
Pyrite | Level 9

Dear all,

I have used SAS for two decades to analyse data from behavioural studies, but now I have some NLP data that presents new challenges, and I am hoping to get some advise. I have 100 variables that correspond to linguistic indices that have been extracted by a corpus. Many of them have very skewed distribution and/or univariate outliers (let's define them as +/- 3SD). Is there a way to let SAS handle that in bulk, without me having to look at each variable and decide which transformation (log, SQRT, etc.) is needed, and identify, or even better, give missing values to outliers? I use SAS Studio.

Many thanks for your attention.

Eman 

17 REPLIES 17
sbxkoenk
SAS Super FREQ

Hello Eman ,

 

With PROC STDIZE you can easily identify the observations that are more than 3 StdDev's from the mean (for all your

numeric variables at once).

 

With PROC BOXPLOT you can also identify outliers (for all your numeric variables at once)

, but the standard boxplot method to detect outliers is not robust against skewness.
Nevertheless you can try and check if you are satisfied with the result.
Use the OUTBOX= Data Set option and check the observations that get these _TYPE_s :

  • LOW Low outlier value
  • HIGH High outlier value
  • FARLOW Low far outlier value
  • FARHIGH High far outlier value

 

For the automatic transformation, I would use PROC TRANSREG to do a 'Univariate Box-Cox' transformation.
You can do a grid-search and the procedure finds the best lambda for you.

 

Cheers,

Koen

PaigeMiller
Diamond | Level 26

PROC STDIZE also offers some robust scaling methods.

--
Paige Miller
PaigeMiller
Diamond | Level 26

@emaneman wrote:

Dear all,

I have used SAS for two decades to analyse data from behavioural studies, but now I have some NLP data that presents new challenges, and I am hoping to get some advise. I have 100 variables that correspond to linguistic indices that have been extracted by a corpus. Many of them have very skewed distribution and/or univariate outliers (let's define them as +/- 3SD). Is there a way to let SAS handle that in bulk, without me having to look at each variable and decide which transformation (log, SQRT, etc.) is needed, and identify, or even better, give missing values to outliers? I use SAS Studio.

Many thanks for your attention.

Eman 


Why do you state that the distribution is skewed? What analysis are you going to do with these variables after outlier removal and transformation. Some analyses don't care about outliers or skewed distributions.

 

Are you going to fit some sort of model? What type of model? What PROC? All of these things are needed to understand what reasonable outlier removal and transformation would be. If it is a linear regression, skewed x-variables and skewed y-variables may not be a problem. For other types of models or other types of analysis, skewed may be a problem. Tell us.

--
Paige Miller
emaneman
Pyrite | Level 9

Thank you all, and sorry for the delayed response. I was out of commission for several days.

 

I am going run GLM and PROC MIXED. While some are quite robust and can handle outliers and non-normally distributed variables, in the journals I aim to publish this, and most importantly in similar articles that have published data like the one I have, the reporting of the details of the distribution are always required by reviewers and to say "it doesn't matter" just makes them more suspicious and stubborn in requesting the info...

 

sbxkoenk
SAS Super FREQ

Hello,

 

Agree.

You should know about your distribution(s) and potential outliers (univariate and multi-variate).

You should report about that as well.

 

But you should also know some techniques are more robust to outliers than others and some techniques can perfectly handle skewed inputs. No need to apply a transformation if it's not a necessity for the analysis you will be doing.

 

This being said :

What other information do you need from us?

 

Thanks,

Koen

emaneman
Pyrite | Level 9

Hello Koen,

I am looking into TRANSREG. I tend to learn better by looking at examples, and those that are available on the online manual do not seem to include one about Univariate Box-Cox transformation. I will expand my search outside of the official SAS manuals for that. 

Will that specific use of TRANSREG give me a pre and post-transformation distribution chart for each variable?

Eman 

sbxkoenk
SAS Super FREQ

Hello,

 

Go to this page

SAS® 9.4 and SAS® Viya® 3.5 Programming Documentation
SAS/STAT User's Guide
The TRANSREG Procedure
Box-Cox Transformations
https://go.documentation.sas.com/doc/en/pgmsascdc/9.4_3.5/statug/statug_transreg_details02.htm?homeO...

 

and start reading at

The next example shows how to find a Box-Cox transformation without an independent variable. This seeks to normalize the univariate histogram. ...

title 'Univariate Box-Cox';

 

Koen

emaneman
Pyrite | Level 9

Hello Koen,

that is terrific. Thank you for pinpointing so specifically, it saved me a lot of time.

I just tried it, with this syntax

proc transreg data=merged maxiter=0 nozeroconstant ;
   model BoxCox(anger) = identity(constante);
   output;
run;

But I am getting this error message:

ERROR: 14 invalid values were encountered while attempting to transform variable anger.

I have 14 observations that have value ZERO on variable anger, so I gather it is trying log transformation and getting undefined values for those 14 cases. 

sbxkoenk
SAS Super FREQ

Hello,

 

Avoid testing lambda = 0 which is a mere log transformation.

Do a grid search on lambda below and above zero.

 

Koen

emaneman
Pyrite | Level 9

great re restricting lambda values. 

proc transreg data=merged maxiter=0 nozeroconstant TSTANDARD=NOMISS;
   model BoxCox(	anger /convenient lambda=-2 to -.05 by .05 lambda=.05 to 2 by 0.05 ) = identity(constante)  ;
   output;
run;

This works and gives me an output that i can easily use. See first one dv output.png

However, when I try to put multiple DV:

proc transreg data=merged maxiter=0 nozeroconstant TSTANDARD=NOMISS;
   model BoxCox(anger admiration love grief /convenient lambda=-2 to -.05 by .05 lambda=.05 to 2 by 0.05 ) = identity(constante)  ;
   output;
run;

The output file is different, and i do not have transformed values for each separate variable. It looks like this:

multiple dvs output.png

 

 

PaigeMiller
Diamond | Level 26

For GLM and MIXED, the distribution of the variables is irrelevant. It is the distribution of the residuals that are important. Sometimes, reviewers are wrong.

 

https://blogs.sas.com/content/iml/2018/08/27/on-the-assumptions-and-misconceptions-of-linear-regress...

 

Transformations are usually applied to the data to make the residuals (approximately) normally distributed, not to make the raw data normally distributed.

 

Yes, outliers can have a high impact on the results, and probably should be identified (and potentially removed). This has already been discussed in this thread, PROC STDIZE and PROC BOXPLOT can do this.

 

--
Paige Miller
emaneman
Pyrite | Level 9

I cannot agree more with the statement about reviewers...

I have had cases in which normalizing a skewed DV (e.g. by SQRT) changed significantly the outcome of a simple GLM model. 

 

"Transformations are usually applied to the data to make the residuals (approximately) normally distributed, not to make the raw data normally distributed."  Any chance you could give me an example of this?  Let's say you have the following syntax:

PROC GLM;
CLASS A B;
MODEL DV = A|B;

And that your residuals are not normally distributed. What would you do next?

SteveDenham
Jade | Level 19

I would look at the data generating process to see what distributions may be appropriate.  Tools for this would include graphical methods (QQ plots and histograms) and PROC TRANSREG.

 

SteveDenham

emaneman
Pyrite | Level 9

Hello Steve, it very much looks like I need to learn the logic of TRANSREG and how to use it.

thank you.

Eman

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 17 replies
  • 1388 views
  • 7 likes
  • 4 in conversation