Hi Team,
I am using a macro that cap outliers based on box-plot / turkey method. But the problem arises PROC MEANS truncates variable names and the same problem happens with PROC LOGISTIC.
Example variable name : prod_INVT_OLD_BAL_AVG_ACT_1_3M
This causes my macro to not working properly.
%macro box(input=, vars=, output= );
%let QR=;
%let Q1=;
%let Q3=;
%let varL=;
%let varH=;
%let n=%sysfunc(countw(&vars));
%do i= 1 %to &n;
%let val = %scan(&vars,&i);
%let QR = &QR &val._QRange;
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
%let varL = &varL &val.L;
%let varH = &varH &val.H;
%end;
/* Calculate the quartiles and inter-quartile range using proc univariate */
proc means data=&output nway noprint;
var &vars;
output out=temp QRANGE = P25= P75= / autoname;
run;
/* Extract the upper and lower limits into macro variables */
data temp;
set temp;
ID = 1;
array vara(&n) &QR;
array varb(&n) &Q1;
array varc(&n) &Q3;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(vara);
lower(i) = varb(i) - 3 * vara(i);
upper(i) = varc(i) + 3 * vara(i);
end;
drop i _type_ _freq_;
run;
data temp1;
set &input;
ID = 1;
run;
data &output;
merge temp1 temp;
by ID;
array vars(&n) &vars;
array lower(&n) &varL;
array upper(&n) &varH;
do i = 1 to dim(vars);
if not missing(vars(i)) then do;
if vars(i) < lower(i) then vars(i) = lower(i);
if vars(i) > upper(i) then vars(i) = upper(i);
end;
end;
drop &QR &Q1 &Q3 &varL &varH ID i;
run;
%mend;
%box(input=abcd, vars = a b c d, output= test);
This is not an easy solution, but it is very scalable and flexible. You would need to change your logic and processing, but you could do this using three output data sets rather than one:
proc means data=&output nway noprint;
var &vars;
output out=qrange (keep=&vars) qrange=;
output out=p25 (keep=&vars) p25=;
output out=p75 (keep&vars) p75=;
run;
By creating three data sets, you are free to re-use the original variable names to hold the calculated statistics. To aggregate them later, you would need to read them one data set at a time:
data stats;
array vars {&n} &vars;
array ranges {&n} range1-range25;
array q1 {&n} q1_1 - q1_&n;
array q3 {&n} q3_1 - q3_&n;
set qrange;
do _i_=1 to &n;
ranges{_i_} = vars{_i_};
end;
set p25;
do _i_=1 to &n;
q1{_i_} = vars{_i_};
end;
set p75;
do _i_=1 to &n;
q3{_i_} = vars{_i_};
end;
... more calculations as needed;
run;
In my mind, it's the right way to go because you never need to change any variable names. Good luck.
Do you know what the maximum length of a SAS variable name is? When you provide a 30 character variable and use autoname option to add enough characters to exceed that number of characters what should SAS do for a result?
This will be a problem with any fixed length limit of variable names though:
prod_INVT_OLD_BAL_AVG_ACT_1_3M
could be
ProdInvtOldBalAvgAct_1_3M
and save a few characters without losing information. Not that I want to type either of those very often.
I know the variable names in SAS can have a maximum length of 32 characters. I am asking for a solution. I cannot remove "_" from a variable name. Thanks for your time!
This is not an easy solution, but it is very scalable and flexible. You would need to change your logic and processing, but you could do this using three output data sets rather than one:
proc means data=&output nway noprint;
var &vars;
output out=qrange (keep=&vars) qrange=;
output out=p25 (keep=&vars) p25=;
output out=p75 (keep&vars) p75=;
run;
By creating three data sets, you are free to re-use the original variable names to hold the calculated statistics. To aggregate them later, you would need to read them one data set at a time:
data stats;
array vars {&n} &vars;
array ranges {&n} range1-range25;
array q1 {&n} q1_1 - q1_&n;
array q3 {&n} q3_1 - q3_&n;
set qrange;
do _i_=1 to &n;
ranges{_i_} = vars{_i_};
end;
set p25;
do _i_=1 to &n;
q1{_i_} = vars{_i_};
end;
set p75;
do _i_=1 to &n;
q3{_i_} = vars{_i_};
end;
... more calculations as needed;
run;
In my mind, it's the right way to go because you never need to change any variable names. Good luck.
What do you want in the output: The complete long variable name or the full suffix for the statistics requested with the autoname option? With variables named prod_INVT_OLD_BAL_AVG_ACT_1_3M you can't have both.
You can get lists of the variables generated by proc means like this:
proc sql ;
select NAME into : QR separated by ' '
from dictionary.columns
where memname='TEMP' and Libname='WORK'
and index(upcase(NAME),'QRANGE')>0;
select NAME into : Q1 separated by ' '
from dictionary.columns
where memname='TEMP' and Libname='WORK'
and index(upcase(NAME),'P25')>0;
select NAME into : Q3 separated by ' '
from dictionary.columns
where memname='TEMP' and Libname='WORK'
and index(upcase(NAME),'P75')>0;
quit;
Assuming none of the variables have the key words QRANGE, P25 or P75 as part of their names.
You wouldn't need the
%let QR = &QR &val._QRange;
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
any more.
Since you are using a macro you can make your own names for the variables you generate.
Call them VAR1 to VAR20 or SAM, JOE, FRED.
Your code has several mistakes beyond your stated issue, that would be more concerning to me:
1. proc means references the &output dataset when it should probably reference the &input dataset.
2. The following would only keep each macro variable retaining the last values, i.e. it doesn't do what you think it does.
%let n=%sysfunc(countw(&vars));
%do i= 1 %to &n;
%let val = %scan(&vars,&i);
%let QR = &QR &val._QRange;
%let Q1 = &Q1 &val._P25;
%let Q3 = &Q3 &val._P75;
%let varL = &varL &val.L;
%let varH = &varH &val.H;
%end;
EDIT:
I think this would have been your original source for the macro:
SAS Macro : Capping Outliers - Listen Data
Here's a quick sample I wrote that works differently. Unless you have huge data where multiple passes on the data is an issue a modification of this method should work well.
SAS Box-Plot/Tukey Method of Capping Outliers
*Calculate IQR and first/third quartiles;
proc means data=sashelp.class stackods n qrange p25 p75;
var weight height;
ods output summary=ranges;
run;
*create data with outliers to check;
data capped;
set sashelp.class;
if name='Alfred' then weight=220;
if name='Jane' then height=-30;
run;
*macro to cap outliers;
%macro cap(dset=,var=, lower=, upper=);
data &dset;
set &dset;
if &var>&upper then &var=&upper;
if &var<&lower then &var=&lower;
run;
%mend;
*create cutoffs and execute macro for each variable;
data cutoffs;
set ranges;
lower=p25-3*qrange;
upper=p75+3*qrange;
string = catt('%cap(dset=capped, var=', variable, ", lower=", lower, ", upper=", upper ,");");
call execute(string);
run;
Thank you so much. It's a nice code but it will be very heavy as my dataset consists of 1 million observations and 1K variables. But i learnt a lot of tricks from your code. Yes, i have copied code from that site. I have written an email to the author regd issues you highlighted just above. God bless you! 🙂
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.