04-12-2015 05:20 AM
Hi SAS Experts -
I have a macro that calculates acceptable range of a variable. An acceptable range is defined by :
Lower Limit = Q1 - 1.5*(Q3-Q1)
Upper Limit = Q3 + 1.5*(Q3-Q1)
It's a boxplot method of calculating outliers. The macro is working fine. But it is inefficient in terms of its processing as it calculates outliers for each variable in a loop and then capping values. I want proc univariate to be run for all the variables (not in loop) and save output in a dataset and then capping for variables using IF THEN at one time only.
Code : -
options mprint symbolgen;
%macro outliers(input=, vars=, output= );
%do i= 1 %to &n;
%let val = %scan(&vars,&i);
/* Calculate the quartiles and inter-quartile range using proc univariate */
proc univariate data=&output noprint;
output out=temp QRANGE= IQR Q1= First_Qtl Q3= Third_Qtl;
/* Extract the upper and lower limits into macro variables */
call symput('QR', IQR);
call symput('Q1', First_Qtl);
call symput('Q3', Third_Qtl);
%let ULimit=%sysevalf(&Q3 + 1.5 * &QR);
%let LLimit=%sysevalf(&Q1 - 1.5 * &QR);
/* Final dataset excluding outliers*/
if &val < &Llimit then &val = &Llimit;
if &val > &Ulimit then &val = &Ulimit;
%outliers(Input=abcd, Vars = a, output= test);
Thanks in anticipation!
10-09-2015 12:14 PM
I can outline an approach, but I don't have the time to give you all the details.
Consider this variation:
proc univariate data=&input noprint;
output out=ranges (keep=&vars) qrange=;
output out=q1 (keep=&vars) q1=;
output out=q3 (keep=&vars) q3=;
That gives you three small output data sets (one observation apiece). You can investigate for yourself, but in the Q1 data set, each variable will be the Q1 value for that same original variable name.
Next step: transpose the three data sets so you have two columns in each (for example, original variable name, and the Q1 value). You're working with small data sets so the processing time will be minimal.
With all three data sets transposed, use a DATA step to read them in and write out IF/THEN statements to a file. Again, you're working with tiny data sets and the processing time will be minimal.
Finally, %include the IF/THEN statements in a DATA step to perform the calculations.
It is conceivable that ODS can save some of the work by producing an output data set with one row per variable and three statistics. I'm not familiar enough with the possible ODS outputs from univariate to know.