Hello
For each custID+VAR I calculated UCL and LCL (control limits).
Then I want to calculate for each CustID+VAR the average without extreme values
Note-Extreme values are values lower LCL or higher UCL
What is the best way to do it
Data have;
input CustID mon X Y Z;
cards;
111 6 10 20 300 10 80
111 5 11 12 13 2 15
111 4 14 15 16 7 20
111 3 17 18 19 6 25
111 2 20 21 22 5 25
111 1 23 24 25 4 30
222 6 40 41 80 10 50
222 5 37 38 39 12 50
222 4 15 13 12 2 20
222 3 12 14 16 5 20
222 2 50 30 20 5 40
222 1 15 16 40 3 25
;
Run;
proc sort data=have;by CustID mon ;Run;
proc transpose data=have out=have2(Rename=(_NAME_=VAR COL1=value));
by CustID mon ;
var X Y Z;
quit;
proc summary data=have2 nway;
class CustID var;
var VALUE;
output out=summary (drop=_type_ _freq_ rename=(CV=CV6 mean=AVG6 N=N6 std=std6 max=max6 min=min6)) cv=CV mean= mean n=n std=std min=min max=max;
run;
data summary2;
set summary;
UCL=AVG6+std6;
LCL=AVG6-std6;
Run;
Data have;
input CustID VAR $ AVG_no_EXT;
cards;
This is the requyired data set
ADVICE: do NOT transpose long data sets. This is a really bad habit you have. Long data sets are already in the proper layout for analyzing.
proc summary data=have nway;
class custid;
var x y z;
output out=stats mean= std=/autoname;
run;
data want; /* Set extreme values to missing */
merge have stats;
by custid;
array data x y z;
array avg x_mean--z_mean;
array stddev x_stddev--z_stddev;
do i=1 to dim(data);
if data(i) > (avg(i)+stddev(i)) or data(i) < (avg(i)-stddev(i)) then data(i)=.;
end;
keep custid mon x y z;
run;
proc means data=want;
var x y z;
output out=means_no_extremes mean=;
run;
Even simpler, use PROC STDIZE (which is particularly helpful if you really have 75 variables instead of 3) and PROC STDIZE works with long (untransposed) data sets.
proc stdize data=have out=stdized sprefix=s_ oprefix=o_;
by custid;
var x y z;
run;
data want;
set stdized;
array s_ s_:; /* Array of standardized values */
array o o_:; /* Array of original data */
/* Determine which standardized values are beyond ±1, make the data missing in that case */
do i=1 to dim(data);
if abs(s_(i)) > 1 then o(i)=.;
end;
run;
proc summary data=want;
by custid;
var o_:;
output out=means_no_extremes mean=;
run;
Note 1: PROC STDIZE is clearly less programming than the first approach
Note 2: you do not have confidence limits or control limits, you are adding one standard deviation to the mean to get the upper limit, and subtracting one standarad deviation from the mean to get the lower limit. Do not call these confidence limits or control limits, they are not. You can call these upper and lower limits for the purpose of detecting outliers, if that's what you want. Even so, they are quite narrow for the purpose of detecting "extreme values", but how wide they should be is up to you.
ADVICE: do NOT transpose long data sets. This is a really bad habit you have. Long data sets are already in the proper layout for analyzing.
proc summary data=have nway;
class custid;
var x y z;
output out=stats mean= std=/autoname;
run;
data want; /* Set extreme values to missing */
merge have stats;
by custid;
array data x y z;
array avg x_mean--z_mean;
array stddev x_stddev--z_stddev;
do i=1 to dim(data);
if data(i) > (avg(i)+stddev(i)) or data(i) < (avg(i)-stddev(i)) then data(i)=.;
end;
keep custid mon x y z;
run;
proc means data=want;
var x y z;
output out=means_no_extremes mean=;
run;
Even simpler, use PROC STDIZE (which is particularly helpful if you really have 75 variables instead of 3) and PROC STDIZE works with long (untransposed) data sets.
proc stdize data=have out=stdized sprefix=s_ oprefix=o_;
by custid;
var x y z;
run;
data want;
set stdized;
array s_ s_:; /* Array of standardized values */
array o o_:; /* Array of original data */
/* Determine which standardized values are beyond ±1, make the data missing in that case */
do i=1 to dim(data);
if abs(s_(i)) > 1 then o(i)=.;
end;
run;
proc summary data=want;
by custid;
var o_:;
output out=means_no_extremes mean=;
run;
Note 1: PROC STDIZE is clearly less programming than the first approach
Note 2: you do not have confidence limits or control limits, you are adding one standard deviation to the mean to get the upper limit, and subtracting one standarad deviation from the mean to get the lower limit. Do not call these confidence limits or control limits, they are not. You can call these upper and lower limits for the purpose of detecting outliers, if that's what you want. Even so, they are quite narrow for the purpose of detecting "extreme values", but how wide they should be is up to you.
It is perfect, however final Avg without extreme should be done for each customer
proc means data=want;
by CustID;
var x y z;
output out=means_no_extremes mean=;
run;
Ok, good catch, and you were able to fix it. Actually, for the PROC STDIZE solution I did include BY CUSTID.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.