BookmarkSubscribeRSS Feed
_maldini_
Barite | Level 11

I am trying to write syntax that will identify outliers. I am using these instructions as guidance. 

I am running into some errors in the DATA step, including "Apparent symbolic reference AMP not resolved." I know nothing about Macros in SAS.

This is my syntax:

 

/* 3 Easy Ways to Find Outliers in SAS */
/* https://sasexamplecode.com/3-easy-ways-to-find-outliers-in-sas/ */

/* 1. Test the Assumption of Normality */
/* The first step if to test the normality assumption. */

/* In SAS, you can use PROC UNIVARIATE to check if your data follow a normal distribution.  */
/* You do this by adding the NORMAL option to the UNIVARIATE statement. */


ods output TestsForNormality = work.normal_test;
ods output BasicMeasures = work.measures;
 
proc univariate data=meta.data_01 normal;
    var age;
 
    histogram age / normal;
run;
 
proc print data=work.normal_test noobs;
run;
 
proc print data=work.measures noobs;
run;


/* 2. Save the Mean and Standard Deviation as Macro Variables */
/* The second step to find outliers is to save the Mean and Standard Deviation as macro variables. */

/* PROC UNIVARIATE can also create a dataset with summary statistics such as the p-value of the normality test, the mean, and the standard deviation.  */
/* To do so, we use the ODS OUTPUT statement. */
/* See above */

/* To make your code reusable and to find the outliers more efficiently, we save the p-value of the Shapiro-Wilk test,  */
/* the mean, and the standard deviation as three macro variables with a SELECT INTO statement. */
/* See below */

proc sql;
    select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk';
    select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean';
    select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation';
quit;

/* 3. Filter the Outliers */
/* The third step to find outliers in SAS is filtering all observations that are 3 standard deviations above or below the mean. */

data work.outliers_normaldistr;
    set meta.data_01;
 
    if age lt (&mean. - 3*&stddev.) 
	or age gt (&mean. + 3*&stddev.) then output;
run;
 
proc print data=work.outliers_normaldistr noobs;
run;

Here is the log:

 

 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
68
69 data work.outliers_normaldistr;
70 set meta.data_01;
71
72 if age lt (&mean. - 3*&stddev.)
_ _____
386 180
76
WARNING: Apparent symbolic reference AMP not resolved.
WARNING: Apparent symbolic reference AMP not resolved.
72 if age lt (&mean. - 3*&stddev.)
_______
180
ERROR 386-185: Expecting an arithmetic expression.
 
ERROR 180-322: Statement is not valid or it is used out of proper order.
 
ERROR 76-322: Syntax error, statement will be ignored.
 
73 or age gt (&mean. + 3*&stddev.) then output;
_____ _______
180 180
WARNING: Apparent symbolic reference AMP not resolved.
WARNING: Apparent symbolic reference AMP not resolved.
ERROR 180-322: Statement is not valid or it is used out of proper order.
 
74 run;
 
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.OUTLIERS_NORMALDISTR may be incomplete. When this step was stopped there were 0 observations and 164
variables.
WARNING: Data set WORK.OUTLIERS_NORMALDISTR was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 1424.18k
OS Memory 36008.00k
Timestamp 11/17/2021 09:54:53 PM
Step Count 282 Switch Count 0
Page Faults 0
Page Reclaims 92
Page Swaps 0
Voluntary Context Switches 6
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 8
 
 
75
76 proc print data=work.outliers_normaldistr noobs;
77 run;
 
NOTE: No variables in data set WORK.OUTLIERS_NORMALDISTR.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 478.43k
OS Memory 35488.00k
Timestamp 11/17/2021 09:54:53 PM
Step Count 283 Switch Count 0
Page Faults 0
Page Reclaims 16
Page Swaps 0
Voluntary Context Switches 0
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 0
 
 
78
79 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
89

 

 Here is a sample data set: 

data WORK.DATA_04;
   infile datalines dsd truncover;
   input Participant_ID:$3. Group:$1. Sex:$2. Age:BEST12. Efflux_V1:BEST12. Efflux_V2:BEST12. Efflux_V3:BEST12. ApoA1_V1:BEST12. ApoA1_V2:BEST12. ApoA1_V3:BEST12. ApoC1_V1:BEST12. ApoC1_V2:BEST12. ApoC1_V3:BEST12.;
   format Age BEST12. Efflux_V1 BEST12. Efflux_V2 BEST12. Efflux_V3 BEST12. ApoA1_V1 BEST12. ApoA1_V2 BEST12. ApoA1_V3 BEST12. ApoC1_V1 BEST12. ApoC1_V2 BEST12. ApoC1_V3 BEST12.;
 datalines;
 1 A M 52 11.68 12.59 11.21 238.65 279.72 171.58 41.22 62.36 36.07
 10 B M 68 9.58 9.18 10.78 215.79 214.9 253.98 47.33 38 50.52
 11 A F 71 12.26 9.17 9.94 282.3 227.08 282.3 44.13 44.21 44.13
 12 B M 71 5.88 9.45 10.55 173.07 230.49 174.09 47.8 51.28 37.81
 13 A F 71 13.17 12.69 11.33 259.03 265.83 255.03 61.34 67.46 73.5
 14 B M 54 10.51 7.96 8.28 211.39 192.76 192.17 41.14 36.83 34.86
 15 A F 66 7.34 6.74 8.69 240.58 160.97 205.72 35.8 25.89 44.28
 16 B F 69 11.07 13.44 10.08 236.45 242.66 214.03 54.07 55.34 37.61
 17 A F 58 8.1 7.62 8.03 188.51 159.8 164.22 36.04 32.35 30.78
 18 B F 63 10.14 10.06 10.78 229.05 252.06 228.63 57.49 63.17 50.44
 ;;;;
Run;

 Finally, if possible, it would be helpful to list the outliers by Participant_ID. Is there a way to modify this syntax to do so?

 

Thank you.

25 REPLIES 25
PaigeMiller
Diamond | Level 26

Macro variable &AMP has never been defined, so you can't refer to it in your code. I see that the web site you link to doesn't provide guidance on that either.

 

But instead of trying to determine if a variable is more than ±3 standard deviations from the mean, you can use PROC STDIZE and then no macro variables are needed at all.

--
Paige Miller
sbxkoenk
SAS Super FREQ

Hello,

 

You are trying to find univariate outliers as I see it.

See these blogs for (other) code.

 

Detecting outliers in SAS: Part 1: Estimating location
By Rick Wicklin on The DO Loop January 20, 2012
https://blogs.sas.com/content/iml/2012/01/20/detecting-outliers-in-sas-part-1-estimating-location.ht...


Detecting outliers in SAS: Part 2: Estimating scale
By Rick Wicklin on The DO Loop January 27, 2012
https://blogs.sas.com/content/iml/2012/01/27/detecting-outliers-in-sas-part-2-estimating-scale.html

 

Let me know if you also want to find multi-variate outliers. That can be done with PROC ROBUSTREG, isolation forests, auto-encoders (NN), PROC PRINCOMP, and some other techniques I am forgetting about right now.

 

Good luck,

Koen

Reeza
Super User

When who ever owns that site uploaded the code, the code got converted to HTML incorrectly. 

Maybe this is a better reference for what you need? 

https://blogs.sas.com/content/sgf/2020/02/05/finding-possible-data-errors-using-the-auto_outliers-ma...

 

 

& is actually supposed to be just &

 

So this 

data work.outliers_normaldistr;
    set meta.data_01;
 
    if age lt (&mean. - 3*&stddev.) 
	or age gt (&mean. + 3*&stddev.) then output;
run;

Should be:

data work.outliers_normaldistr;
    set meta.data_01;
 
    if age lt (&mean. - 3*&stddev.) 
	or age gt (&mean. + 3*&stddev.) then output;
run;

@_maldini_ wrote:

I am trying to write syntax that will identify outliers. I am using these instructions as guidance. 

I am running into some errors in the DATA step, including "Apparent symbolic reference AMP not resolved." I know nothing about Macros in SAS.

This is my syntax:

 

/* 3 Easy Ways to Find Outliers in SAS */
/* https://sasexamplecode.com/3-easy-ways-to-find-outliers-in-sas/ */

/* 1. Test the Assumption of Normality */
/* The first step if to test the normality assumption. */

/* In SAS, you can use PROC UNIVARIATE to check if your data follow a normal distribution.  */
/* You do this by adding the NORMAL option to the UNIVARIATE statement. */


ods output TestsForNormality = work.normal_test;
ods output BasicMeasures = work.measures;
 
proc univariate data=meta.data_01 normal;
    var age;
 
    histogram age / normal;
run;
 
proc print data=work.normal_test noobs;
run;
 
proc print data=work.measures noobs;
run;


/* 2. Save the Mean and Standard Deviation as Macro Variables */
/* The second step to find outliers is to save the Mean and Standard Deviation as macro variables. */

/* PROC UNIVARIATE can also create a dataset with summary statistics such as the p-value of the normality test, the mean, and the standard deviation.  */
/* To do so, we use the ODS OUTPUT statement. */
/* See above */

/* To make your code reusable and to find the outliers more efficiently, we save the p-value of the Shapiro-Wilk test,  */
/* the mean, and the standard deviation as three macro variables with a SELECT INTO statement. */
/* See below */

proc sql;
    select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk';
    select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean';
    select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation';
quit;

/* 3. Filter the Outliers */
/* The third step to find outliers in SAS is filtering all observations that are 3 standard deviations above or below the mean. */

data work.outliers_normaldistr;
    set meta.data_01;
 
    if age lt (&mean. - 3*&stddev.) 
	or age gt (&mean. + 3*&stddev.) then output;
run;
 
proc print data=work.outliers_normaldistr noobs;
run;

Here is the log:

 

 
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
68
69 data work.outliers_normaldistr;
70 set meta.data_01;
71
72 if age lt (&mean. - 3*&stddev.)
_ _____
386 180
76
WARNING: Apparent symbolic reference AMP not resolved.
WARNING: Apparent symbolic reference AMP not resolved.
72 if age lt (&mean. - 3*&stddev.)
_______
180
ERROR 386-185: Expecting an arithmetic expression.
 
ERROR 180-322: Statement is not valid or it is used out of proper order.
 
ERROR 76-322: Syntax error, statement will be ignored.
 
73 or age gt (&mean. + 3*&stddev.) then output;
_____ _______
180 180
WARNING: Apparent symbolic reference AMP not resolved.
WARNING: Apparent symbolic reference AMP not resolved.
ERROR 180-322: Statement is not valid or it is used out of proper order.
 
74 run;
 
NOTE: The SAS System stopped processing this step because of errors.
WARNING: The data set WORK.OUTLIERS_NORMALDISTR may be incomplete. When this step was stopped there were 0 observations and 164
variables.
WARNING: Data set WORK.OUTLIERS_NORMALDISTR was not replaced because this step was stopped.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 1424.18k
OS Memory 36008.00k
Timestamp 11/17/2021 09:54:53 PM
Step Count 282 Switch Count 0
Page Faults 0
Page Reclaims 92
Page Swaps 0
Voluntary Context Switches 6
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 8
 
 
75
76 proc print data=work.outliers_normaldistr noobs;
77 run;
 
NOTE: No variables in data set WORK.OUTLIERS_NORMALDISTR.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.00 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
memory 478.43k
OS Memory 35488.00k
Timestamp 11/17/2021 09:54:53 PM
Step Count 283 Switch Count 0
Page Faults 0
Page Reclaims 16
Page Swaps 0
Voluntary Context Switches 0
Involuntary Context Switches 0
Block Input Operations 0
Block Output Operations 0
 
 
78
79 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;
89

 

 Here is a sample data set: 

data WORK.DATA_04;
   infile datalines dsd truncover;
   input Participant_ID:$3. Group:$1. Sex:$2. Age:BEST12. Efflux_V1:BEST12. Efflux_V2:BEST12. Efflux_V3:BEST12. ApoA1_V1:BEST12. ApoA1_V2:BEST12. ApoA1_V3:BEST12. ApoC1_V1:BEST12. ApoC1_V2:BEST12. ApoC1_V3:BEST12.;
   format Age BEST12. Efflux_V1 BEST12. Efflux_V2 BEST12. Efflux_V3 BEST12. ApoA1_V1 BEST12. ApoA1_V2 BEST12. ApoA1_V3 BEST12. ApoC1_V1 BEST12. ApoC1_V2 BEST12. ApoC1_V3 BEST12.;
 datalines;
 1 A M 52 11.68 12.59 11.21 238.65 279.72 171.58 41.22 62.36 36.07
 10 B M 68 9.58 9.18 10.78 215.79 214.9 253.98 47.33 38 50.52
 11 A F 71 12.26 9.17 9.94 282.3 227.08 282.3 44.13 44.21 44.13
 12 B M 71 5.88 9.45 10.55 173.07 230.49 174.09 47.8 51.28 37.81
 13 A F 71 13.17 12.69 11.33 259.03 265.83 255.03 61.34 67.46 73.5
 14 B M 54 10.51 7.96 8.28 211.39 192.76 192.17 41.14 36.83 34.86
 15 A F 66 7.34 6.74 8.69 240.58 160.97 205.72 35.8 25.89 44.28
 16 B F 69 11.07 13.44 10.08 236.45 242.66 214.03 54.07 55.34 37.61
 17 A F 58 8.1 7.62 8.03 188.51 159.8 164.22 36.04 32.35 30.78
 18 B F 63 10.14 10.06 10.78 229.05 252.06 228.63 57.49 63.17 50.44
 ;;;;
Run;

 Finally, if possible, it would be helpful to list the outliers by Participant_ID. Is there a way to modify this syntax to do so?

 

Thank you.


 

_maldini_
Barite | Level 11

@Reeza Thank you.

This worked (for age) and I will accept this as the solution. But...it leads to 2 other questions/issues:

 

  1. Why doesn't this work when I change the variable from age to Efflux_V1, for example?
  2. How can I adapt this syntax to incorporate multiple variables?

 

1. Why doesn't this work when I change the variable from age to Efflux_V1?

When I keep the syntax exactly the same, but change age to Efflux_V1, I get the following:

Screen Shot 2021-11-18 at 9.23.55 AM.png

12.07 does not meet the definition of an outlier (i.e., (&mean. - 3*&stddev.) , (&mean. + 3*&stddev.)).

 

2. How can I adapt this syntax to incorporate multiple variables?

Here is my attempt to simply add 3 more variables, in addition to age: 

ods output TestsForNormality = work.normal_test;
ods output BasicMeasures = work.measures;
 
proc univariate data=meta.data_01 normal;
    var 
    age
 	Efflux_V1
	Efflux_V2
	Efflux_V3
	;
 
/*     histogram age / normal; */
run;
 
proc print data=work.normal_test noobs;
run;
 
proc print data=work.measures noobs;
run;

proc sql;
    select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk';
    select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean';
    select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation';
quit;

data work.outliers_normaldistr;
    set meta.data_01;
 
    if age lt (&mean. - 3*&stddev.) 
	or age gt (&mean. + 3*&stddev.) then output;
	
	if Efflux_V1 lt (&mean. - 3*&stddev.) 
	or Efflux_V1 gt (&mean. + 3*&stddev.) then output;

	if Efflux_V2 lt (&mean. - 3*&stddev.) 
	or Efflux_V2 gt (&mean. + 3*&stddev.) then output;
	
	if Efflux_V3 lt (&mean. - 3*&stddev.) 
	or Efflux_V3 gt (&mean. + 3*&stddev.) then output;
run;


proc print data=work.outliers_normaldistr noobs;
Var participant_id age Efflux_V1--Efflux_V3;
/* Var participant_id age; */
run;

Here is the output: 

Screen Shot 2021-11-18 at 9.13.32 AM.png

 I will continue to review the other resources you posted. Thank you for your time. 

Reeza
Super User
This code isn't designed to handle multiple variables and I wouldn't use it as such, instead I would highly recommend using one of the other solutions suggested by myself or Koen. Looking through the steps of this procedure, it captures the mean/stdev but does so for one at a time. So this solution to be re-written to handle multiple values would be cumbersome when there are better options out there.

Also this code outputs only the outliers. Usually you want to do something with them, ie cap, remove or examine, so you may want to explain what you'd like to do with the outputs.

As shown your output doesn't make sense because it's using the mean/std from the age variable most likely which have no relation to the other variables so that's why the output is all wrong.


PaigeMiller
Diamond | Level 26

How can I adapt this syntax to incorporate multiple variables?

 

PROC STDIZE

PROC STDIZE

PROC STDIZE

--
Paige Miller
_maldini_
Barite | Level 11

@PaigeMiller 

 

Looking at the STDIZE Procedure documentation. I'm not sure how this solves my problem. My goal is to identify outliers (PROC UNIVARIATE shows extreme values, but it doesn't help me determine if they meet my definition of an outlier, e.g. = 1.5*IQR, 3STD*mean) for multiple variables and then remove them, if they should be removed.

 

If I'm understanding this correctly (big IF), in the STDIZE Procedure documentation, it uses PROC UNIVARIATE to find extreme values by group. Then uses PROC STDIZE with the STD method to compute location and scale measures (Don't we already have those from PROC UNIVARIATE?). Then It uses various standardization methods (i.e., MAD, IQR, ABW) to produce location and scale measures that are resistant to outliers, and superior to the STD method, and somehow uses a tuning constant (derived how?) to do so. Then it uses a DATA step to delete the outlier. And it does this for one variable.

 

Where does it determine that "64" is actually an outlier and not just an extreme value? Because of its impact on the dispersion ratio? I don't see formulas for dispersion ratio for figure 109.1 or the tuning constant. Do I need these?

 

I'm not a professional statistician or a very sophisticated SAS user. Any 4th grade level explanation you could offer would be appreciated!

 

 

 

 

Reeza
Super User
That was one of three or four solutions suggested. Did you look at the auto_outliers I linked to that allows you to specify the criteria, put in your input data set and then generates an output file of all possible outliers?

https://blogs.sas.com/content/sgf/2020/02/05/finding-possible-data-errors-using-the-auto_outliers-ma...

You don't indicate what you want to do with these so we can't offer better suggestions than identifying them at this point.

_maldini_
Barite | Level 11
Not yet. I left that one to last. I got scared when I saw that it involved macros...The trimmed and winsorized means blog posts were great. Thank you.
My goal: Find some syntax that I can use repeatedly to find outliers - for multiple variables - based on a set of pre-defined criteria (e.g., 1.5*IQR, 3*STD). Evaluate those outliers and decide if they should remain in the data set (more based on whether they are errors / reasonable than on their influence on the measures of central tendency). And then delete them, if I determine they are errors or not reasonable.
sbxkoenk
SAS Super FREQ

Hello,

 

I am "afraid" a macro is unavoidable.

But nothing to be scared about ...

 

If you do not have Enterprise Miner or Model Studio, you need coding for this and with coding you will need macro's.

Check out if this entry brings you any further :

Macro to identify and output outliers for multiple variables in a dataset
https://communities.sas.com/t5/SAS-Programming/Macro-to-identify-and-output-outliers-for-multiple-va...

 

Good luck,

Koen

sbxkoenk
SAS Super FREQ

Hello @_maldini_ ,

 

you ask : Where does it determine that "64" is actually an outlier and not just an extreme value?

 

... Outlier versus "just" an extreme value : That is a difficult consideration to make. And a matter of definitions. It also depends on what you want to achieve.

If your subject matter is so "sensitive", I am afraid the simple rule with the standard box plot will not help you out.

There exist Robust / Adjusted Box-plots for Detecting Outliers in Skewed Distributions. You might need these.

What do you want to do with your data? If you will fit a model, you can also determine every observation's relative impact on the results.

[EDIT] I am NOT saying an observation with a high weight / impact should be considered an outlier, let alone it should be removed from the data.

 

And again : uni-variate outlier detection can be done with multiple methods, not just via mean and std and IQR-calculations.

 

Koen

_maldini_
Barite | Level 11

@Reeza Do I need to create or define the macro before running the syntax? How do I run a macro?

%Auto_Outliers(Dsn=Clean.Patients,
               Id=Patno,
               Var_List=HR SBP DBP,
               Trim=.1,
               N_Sd=2.5)
Reeza
Super User

@_maldini_ wrote:

@Reeza Do I need to create or define the macro before running the syntax? How do I run a macro?

%Auto_Outliers(Dsn=Clean.Patients,
               Id=Patno,
               Var_List=HR SBP DBP,
               Trim=.1,
               N_Sd=2.5)

1. Download the macro code from the page as instructed

2. Save it to your computer and note the location

3. Compile the macro and run it

 

So you code will look like the following in the end. 

 

%include 'path to file from Step 2.sas' / source2;

%auto_outliers(dsn=.....);

UCLA introductory tutorial on macro variables and macros
https://stats.idre.ucla.edu/sas/seminars/sas-macros-introduction/

Examples of common macro usage
https://communities.sas.com/t5/SAS-Communities-Library/SAS-9-4-Macro-Language-Reference-Has-a-New-Ap...

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 25 replies
  • 1453 views
  • 20 likes
  • 5 in conversation