I am trying to write syntax that will identify outliers. I am using these instructions as guidance.
I am running into some errors in the DATA step, including "Apparent symbolic reference AMP not resolved." I know nothing about Macros in SAS.
This is my syntax:
/* 3 Easy Ways to Find Outliers in SAS */ /* https://sasexamplecode.com/3-easy-ways-to-find-outliers-in-sas/ */ /* 1. Test the Assumption of Normality */ /* The first step if to test the normality assumption. */ /* In SAS, you can use PROC UNIVARIATE to check if your data follow a normal distribution. */ /* You do this by adding the NORMAL option to the UNIVARIATE statement. */ ods output TestsForNormality = work.normal_test; ods output BasicMeasures = work.measures; proc univariate data=meta.data_01 normal; var age; histogram age / normal; run; proc print data=work.normal_test noobs; run; proc print data=work.measures noobs; run; /* 2. Save the Mean and Standard Deviation as Macro Variables */ /* The second step to find outliers is to save the Mean and Standard Deviation as macro variables. */ /* PROC UNIVARIATE can also create a dataset with summary statistics such as the p-value of the normality test, the mean, and the standard deviation. */ /* To do so, we use the ODS OUTPUT statement. */ /* See above */ /* To make your code reusable and to find the outliers more efficiently, we save the p-value of the Shapiro-Wilk test, */ /* the mean, and the standard deviation as three macro variables with a SELECT INTO statement. */ /* See below */ proc sql; select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk'; select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean'; select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation'; quit; /* 3. Filter the Outliers */ /* The third step to find outliers in SAS is filtering all observations that are 3 standard deviations above or below the mean. */ data work.outliers_normaldistr; set meta.data_01; if age lt (&mean. - 3*&stddev.) or age gt (&mean. + 3*&stddev.) then output; run; proc print data=work.outliers_normaldistr noobs; run;
Here is the log:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;6869 data work.outliers_normaldistr;70 set meta.data_01;7172 if age lt (&mean. - 3*&stddev.)_ _____386 18076WARNING: Apparent symbolic reference AMP not resolved.WARNING: Apparent symbolic reference AMP not resolved.72 if age lt (&mean. - 3*&stddev.)_______180ERROR 386-185: Expecting an arithmetic expression.ERROR 180-322: Statement is not valid or it is used out of proper order.ERROR 76-322: Syntax error, statement will be ignored.73 or age gt (&mean. + 3*&stddev.) then output;_____ _______180 180WARNING: Apparent symbolic reference AMP not resolved.WARNING: Apparent symbolic reference AMP not resolved.ERROR 180-322: Statement is not valid or it is used out of proper order.74 run;NOTE: The SAS System stopped processing this step because of errors.WARNING: The data set WORK.OUTLIERS_NORMALDISTR may be incomplete. When this step was stopped there were 0 observations and 164variables.WARNING: Data set WORK.OUTLIERS_NORMALDISTR was not replaced because this step was stopped.NOTE: DATA statement used (Total process time):real time 0.00 secondsuser cpu time 0.00 secondssystem cpu time 0.00 secondsmemory 1424.18kOS Memory 36008.00kTimestamp 11/17/2021 09:54:53 PMStep Count 282 Switch Count 0Page Faults 0Page Reclaims 92Page Swaps 0Voluntary Context Switches 6Involuntary Context Switches 0Block Input Operations 0Block Output Operations 87576 proc print data=work.outliers_normaldistr noobs;77 run;NOTE: No variables in data set WORK.OUTLIERS_NORMALDISTR.NOTE: PROCEDURE PRINT used (Total process time):real time 0.00 secondsuser cpu time 0.00 secondssystem cpu time 0.00 secondsmemory 478.43kOS Memory 35488.00kTimestamp 11/17/2021 09:54:53 PMStep Count 283 Switch Count 0Page Faults 0Page Reclaims 16Page Swaps 0Voluntary Context Switches 0Involuntary Context Switches 0Block Input Operations 0Block Output Operations 07879 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;89
Here is a sample data set:
data WORK.DATA_04; infile datalines dsd truncover; input Participant_ID:$3. Group:$1. Sex:$2. Age:BEST12. Efflux_V1:BEST12. Efflux_V2:BEST12. Efflux_V3:BEST12. ApoA1_V1:BEST12. ApoA1_V2:BEST12. ApoA1_V3:BEST12. ApoC1_V1:BEST12. ApoC1_V2:BEST12. ApoC1_V3:BEST12.; format Age BEST12. Efflux_V1 BEST12. Efflux_V2 BEST12. Efflux_V3 BEST12. ApoA1_V1 BEST12. ApoA1_V2 BEST12. ApoA1_V3 BEST12. ApoC1_V1 BEST12. ApoC1_V2 BEST12. ApoC1_V3 BEST12.; datalines; 1 A M 52 11.68 12.59 11.21 238.65 279.72 171.58 41.22 62.36 36.07 10 B M 68 9.58 9.18 10.78 215.79 214.9 253.98 47.33 38 50.52 11 A F 71 12.26 9.17 9.94 282.3 227.08 282.3 44.13 44.21 44.13 12 B M 71 5.88 9.45 10.55 173.07 230.49 174.09 47.8 51.28 37.81 13 A F 71 13.17 12.69 11.33 259.03 265.83 255.03 61.34 67.46 73.5 14 B M 54 10.51 7.96 8.28 211.39 192.76 192.17 41.14 36.83 34.86 15 A F 66 7.34 6.74 8.69 240.58 160.97 205.72 35.8 25.89 44.28 16 B F 69 11.07 13.44 10.08 236.45 242.66 214.03 54.07 55.34 37.61 17 A F 58 8.1 7.62 8.03 188.51 159.8 164.22 36.04 32.35 30.78 18 B F 63 10.14 10.06 10.78 229.05 252.06 228.63 57.49 63.17 50.44 ;;;; Run;
Finally, if possible, it would be helpful to list the outliers by Participant_ID. Is there a way to modify this syntax to do so?
Thank you.
Macro variable & has never been defined, so you can't refer to it in your code. I see that the web site you link to doesn't provide guidance on that either.
But instead of trying to determine if a variable is more than ±3 standard deviations from the mean, you can use PROC STDIZE and then no macro variables are needed at all.
Hello,
You are trying to find univariate outliers as I see it.
See these blogs for (other) code.
Detecting outliers in SAS: Part 1: Estimating location
By Rick Wicklin on The DO Loop January 20, 2012
https://blogs.sas.com/content/iml/2012/01/20/detecting-outliers-in-sas-part-1-estimating-location.ht...
Detecting outliers in SAS: Part 2: Estimating scale
By Rick Wicklin on The DO Loop January 27, 2012
https://blogs.sas.com/content/iml/2012/01/27/detecting-outliers-in-sas-part-2-estimating-scale.html
Let me know if you also want to find multi-variate outliers. That can be done with PROC ROBUSTREG, isolation forests, auto-encoders (NN), PROC PRINCOMP, and some other techniques I am forgetting about right now.
Good luck,
Koen
When who ever owns that site uploaded the code, the code got converted to HTML incorrectly.
Maybe this is a better reference for what you need?
& is actually supposed to be just &
So this
data work.outliers_normaldistr;
set meta.data_01;
if age lt (&mean. - 3*&stddev.)
or age gt (&mean. + 3*&stddev.) then output;
run;
Should be:
data work.outliers_normaldistr;
set meta.data_01;
if age lt (&mean. - 3*&stddev.)
or age gt (&mean. + 3*&stddev.) then output;
run;
@_maldini_ wrote:
I am trying to write syntax that will identify outliers. I am using these instructions as guidance.
I am running into some errors in the DATA step, including "Apparent symbolic reference AMP not resolved." I know nothing about Macros in SAS.
This is my syntax:
/* 3 Easy Ways to Find Outliers in SAS */ /* https://sasexamplecode.com/3-easy-ways-to-find-outliers-in-sas/ */ /* 1. Test the Assumption of Normality */ /* The first step if to test the normality assumption. */ /* In SAS, you can use PROC UNIVARIATE to check if your data follow a normal distribution. */ /* You do this by adding the NORMAL option to the UNIVARIATE statement. */ ods output TestsForNormality = work.normal_test; ods output BasicMeasures = work.measures; proc univariate data=meta.data_01 normal; var age; histogram age / normal; run; proc print data=work.normal_test noobs; run; proc print data=work.measures noobs; run; /* 2. Save the Mean and Standard Deviation as Macro Variables */ /* The second step to find outliers is to save the Mean and Standard Deviation as macro variables. */ /* PROC UNIVARIATE can also create a dataset with summary statistics such as the p-value of the normality test, the mean, and the standard deviation. */ /* To do so, we use the ODS OUTPUT statement. */ /* See above */ /* To make your code reusable and to find the outliers more efficiently, we save the p-value of the Shapiro-Wilk test, */ /* the mean, and the standard deviation as three macro variables with a SELECT INTO statement. */ /* See below */ proc sql; select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk'; select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean'; select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation'; quit; /* 3. Filter the Outliers */ /* The third step to find outliers in SAS is filtering all observations that are 3 standard deviations above or below the mean. */ data work.outliers_normaldistr; set meta.data_01; if age lt (&mean. - 3*&stddev.) or age gt (&mean. + 3*&stddev.) then output; run; proc print data=work.outliers_normaldistr noobs; run;Here is the log:
1 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;6869 data work.outliers_normaldistr;70 set meta.data_01;7172 if age lt (&mean. - 3*&stddev.)_ _____386 18076WARNING: Apparent symbolic reference AMP not resolved.WARNING: Apparent symbolic reference AMP not resolved.72 if age lt (&mean. - 3*&stddev.)_______180ERROR 386-185: Expecting an arithmetic expression.ERROR 180-322: Statement is not valid or it is used out of proper order.ERROR 76-322: Syntax error, statement will be ignored.73 or age gt (&mean. + 3*&stddev.) then output;_____ _______180 180WARNING: Apparent symbolic reference AMP not resolved.WARNING: Apparent symbolic reference AMP not resolved.ERROR 180-322: Statement is not valid or it is used out of proper order.74 run;NOTE: The SAS System stopped processing this step because of errors.WARNING: The data set WORK.OUTLIERS_NORMALDISTR may be incomplete. When this step was stopped there were 0 observations and 164variables.WARNING: Data set WORK.OUTLIERS_NORMALDISTR was not replaced because this step was stopped.NOTE: DATA statement used (Total process time):real time 0.00 secondsuser cpu time 0.00 secondssystem cpu time 0.00 secondsmemory 1424.18kOS Memory 36008.00kTimestamp 11/17/2021 09:54:53 PMStep Count 282 Switch Count 0Page Faults 0Page Reclaims 92Page Swaps 0Voluntary Context Switches 6Involuntary Context Switches 0Block Input Operations 0Block Output Operations 87576 proc print data=work.outliers_normaldistr noobs;77 run;NOTE: No variables in data set WORK.OUTLIERS_NORMALDISTR.NOTE: PROCEDURE PRINT used (Total process time):real time 0.00 secondsuser cpu time 0.00 secondssystem cpu time 0.00 secondsmemory 478.43kOS Memory 35488.00kTimestamp 11/17/2021 09:54:53 PMStep Count 283 Switch Count 0Page Faults 0Page Reclaims 16Page Swaps 0Voluntary Context Switches 0Involuntary Context Switches 0Block Input Operations 0Block Output Operations 07879 OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK;89
Here is a sample data set:
data WORK.DATA_04; infile datalines dsd truncover; input Participant_ID:$3. Group:$1. Sex:$2. Age:BEST12. Efflux_V1:BEST12. Efflux_V2:BEST12. Efflux_V3:BEST12. ApoA1_V1:BEST12. ApoA1_V2:BEST12. ApoA1_V3:BEST12. ApoC1_V1:BEST12. ApoC1_V2:BEST12. ApoC1_V3:BEST12.; format Age BEST12. Efflux_V1 BEST12. Efflux_V2 BEST12. Efflux_V3 BEST12. ApoA1_V1 BEST12. ApoA1_V2 BEST12. ApoA1_V3 BEST12. ApoC1_V1 BEST12. ApoC1_V2 BEST12. ApoC1_V3 BEST12.; datalines; 1 A M 52 11.68 12.59 11.21 238.65 279.72 171.58 41.22 62.36 36.07 10 B M 68 9.58 9.18 10.78 215.79 214.9 253.98 47.33 38 50.52 11 A F 71 12.26 9.17 9.94 282.3 227.08 282.3 44.13 44.21 44.13 12 B M 71 5.88 9.45 10.55 173.07 230.49 174.09 47.8 51.28 37.81 13 A F 71 13.17 12.69 11.33 259.03 265.83 255.03 61.34 67.46 73.5 14 B M 54 10.51 7.96 8.28 211.39 192.76 192.17 41.14 36.83 34.86 15 A F 66 7.34 6.74 8.69 240.58 160.97 205.72 35.8 25.89 44.28 16 B F 69 11.07 13.44 10.08 236.45 242.66 214.03 54.07 55.34 37.61 17 A F 58 8.1 7.62 8.03 188.51 159.8 164.22 36.04 32.35 30.78 18 B F 63 10.14 10.06 10.78 229.05 252.06 228.63 57.49 63.17 50.44 ;;;; Run;Finally, if possible, it would be helpful to list the outliers by Participant_ID. Is there a way to modify this syntax to do so?
Thank you.
@Reeza Thank you.
This worked (for age) and I will accept this as the solution. But...it leads to 2 other questions/issues:
1. Why doesn't this work when I change the variable from age to Efflux_V1?
When I keep the syntax exactly the same, but change age to Efflux_V1, I get the following:
12.07 does not meet the definition of an outlier (i.e., (&mean. - 3*&stddev.) , (&mean. + 3*&stddev.)).
2. How can I adapt this syntax to incorporate multiple variables?
Here is my attempt to simply add 3 more variables, in addition to age:
ods output TestsForNormality = work.normal_test;
ods output BasicMeasures = work.measures;
proc univariate data=meta.data_01 normal;
var
age
Efflux_V1
Efflux_V2
Efflux_V3
;
/* histogram age / normal; */
run;
proc print data=work.normal_test noobs;
run;
proc print data=work.measures noobs;
run;
proc sql;
select pValue label= 'p-value' into :pvalue from work.normal_test where test = 'Shapiro-Wilk';
select LocValue label = 'Mean' into :mean from work.measures where LocMeasure ='Mean';
select VarValue label = 'Std Dev' into :stddev from work.measures where VarMeasure ='Std Deviation';
quit;
data work.outliers_normaldistr;
set meta.data_01;
if age lt (&mean. - 3*&stddev.)
or age gt (&mean. + 3*&stddev.) then output;
if Efflux_V1 lt (&mean. - 3*&stddev.)
or Efflux_V1 gt (&mean. + 3*&stddev.) then output;
if Efflux_V2 lt (&mean. - 3*&stddev.)
or Efflux_V2 gt (&mean. + 3*&stddev.) then output;
if Efflux_V3 lt (&mean. - 3*&stddev.)
or Efflux_V3 gt (&mean. + 3*&stddev.) then output;
run;
proc print data=work.outliers_normaldistr noobs;
Var participant_id age Efflux_V1--Efflux_V3;
/* Var participant_id age; */
run;
Here is the output:
I will continue to review the other resources you posted. Thank you for your time.
How can I adapt this syntax to incorporate multiple variables?
PROC STDIZE
PROC STDIZE
PROC STDIZE
Looking at the STDIZE Procedure documentation. I'm not sure how this solves my problem. My goal is to identify outliers (PROC UNIVARIATE shows extreme values, but it doesn't help me determine if they meet my definition of an outlier, e.g. = 1.5*IQR, 3STD*mean) for multiple variables and then remove them, if they should be removed.
If I'm understanding this correctly (big IF), in the STDIZE Procedure documentation, it uses PROC UNIVARIATE to find extreme values by group. Then uses PROC STDIZE with the STD method to compute location and scale measures (Don't we already have those from PROC UNIVARIATE?). Then It uses various standardization methods (i.e., MAD, IQR, ABW) to produce location and scale measures that are resistant to outliers, and superior to the STD method, and somehow uses a tuning constant (derived how?) to do so. Then it uses a DATA step to delete the outlier. And it does this for one variable.
Where does it determine that "64" is actually an outlier and not just an extreme value? Because of its impact on the dispersion ratio? I don't see formulas for dispersion ratio for figure 109.1 or the tuning constant. Do I need these?
I'm not a professional statistician or a very sophisticated SAS user. Any 4th grade level explanation you could offer would be appreciated!
Hello,
I am "afraid" a macro is unavoidable.
But nothing to be scared about ...
If you do not have Enterprise Miner or Model Studio, you need coding for this and with coding you will need macro's.
Check out if this entry brings you any further :
Macro to identify and output outliers for multiple variables in a dataset
https://communities.sas.com/t5/SAS-Programming/Macro-to-identify-and-output-outliers-for-multiple-va...
Good luck,
Koen
Hello @_maldini_ ,
you ask : Where does it determine that "64" is actually an outlier and not just an extreme value?
... Outlier versus "just" an extreme value : That is a difficult consideration to make. And a matter of definitions. It also depends on what you want to achieve.
If your subject matter is so "sensitive", I am afraid the simple rule with the standard box plot will not help you out.
There exist Robust / Adjusted Box-plots for Detecting Outliers in Skewed Distributions. You might need these.
What do you want to do with your data? If you will fit a model, you can also determine every observation's relative impact on the results.
[EDIT] I am NOT saying an observation with a high weight / impact should be considered an outlier, let alone it should be removed from the data.
And again : uni-variate outlier detection can be done with multiple methods, not just via mean and std and IQR-calculations.
Koen
@Reeza Do I need to create or define the macro before running the syntax? How do I run a macro?
%Auto_Outliers(Dsn=Clean.Patients,
Id=Patno,
Var_List=HR SBP DBP,
Trim=.1,
N_Sd=2.5)
@_maldini_ wrote:
@Reeza Do I need to create or define the macro before running the syntax? How do I run a macro?
%Auto_Outliers(Dsn=Clean.Patients, Id=Patno, Var_List=HR SBP DBP, Trim=.1, N_Sd=2.5)
1. Download the macro code from the page as instructed
2. Save it to your computer and note the location
3. Compile the macro and run it
So you code will look like the following in the end.
%include 'path to file from Step 2.sas' / source2;
%auto_outliers(dsn=.....);
UCLA introductory tutorial on macro variables and macros
https://stats.idre.ucla.edu/sas/seminars/sas-macros-introduction/
Examples of common macro usage
https://communities.sas.com/t5/SAS-Communities-Library/SAS-9-4-Macro-Language-Reference-Has-a-New-Ap...
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.