For those who have been following my posts, here is the complete problem. I've been working out, as we go, how the process would go, and wasn't really sure before.
Here is a data set. In this example, there are three variables, V1, V2, V3. Each variable covers the same set of counties (a1-a5). V_other is the value of V1, V2, V3, by county. "Kurtosis" is the kurtosis for each variable. In this example, V1 has the largest kurtosis (44). V3 has the second largest kurtosis (36) and V2 has the third largest (24). Kurtosis is basically an indicator of outliers. In this example, V1 has the highest kurtosis, and it looks like counties a4 and a5 may be outliers, but if course, I don't know that yet. V3 has the second highest kurtosis, and it looks like county a2 might be an outlier. And for V2, counties a1 and a3 might be outliers. I made up all this data, so the kurtosis isn't really what it would be if I calculated it, but just for this example.
Now I want to eventually get, for V1, V2 and V3, sets of counties that have no outliers, that is, kurtosis <= 3.
So I think the sequence of events would be, for each V1, V2, V3, sort the counties by v_other, remove the county with the largest value, and recalculate the kurtosis. If it's larger than 3, then repeat, until kurtosis is <= 3 If possible, also keep a list of counties that had to be removed.
The complications are that before conducting this analysis, I don't know which variable will have the largest kurtosis, I don't know, within each variable, which counties will be the most outlier, nor which and how many counties will have to be removed for each variable.
Clear? Any suggestions appreciated.
variable | county | Kurtosis | v_other | VarOrder |
v1 | a1 | 44 | 3 | 1 |
v1 | a2 | 44 | 4 | 1 |
v1 | a3 | 44 | 2 | 1 |
v1 | a4 | 44 | 12 | 1 |
v1 | a5 | 44 | 25 | 1 |
v3 | a1 | 36 | 1 | 2 |
v3 | a2 | 36 | 45 | 2 |
v3 | a3 | 36 | 2 | 2 |
v3 | a4 | 36 | 5 | 2 |
v3 | a5 | 36 | 3 | 2 |
v2 | a1 | 24 | 18 | 3 |
v2 | a2 | 24 | 2 | 3 |
v2 | a3 | 24 | 40 | 3 |
v2 | a4 | 24 | 4 | 3 |
v2 | a5 | 24 | 5 | 3 |
Thanks
Gene
Here is one approach using nested macros.
The inside macro, let us refer to it as %kur, would do what you say here in pseudo-code for one variable at a time:
The second macro is a %driver that iterates over variable list like so (notice that %kur is inside the %driver):
% let list = var1 var2 var3;
%macro driver(list);
%let n = %sysfunc(countw(&list));
%do j=1 %to &n;
%let a = %scan(&list, &j);
%kur(&a);
%end;
%mend;
As an operational task, you can, for each var_order:
Here's code using your data (for which only the second group starts out with kurtosis>3):
data have (drop=kurtosis);
input variable :$2. county :$2. Kurtosis var_value VarOrder;
datalines;
v1 a1 44 3 1
v1 a2 44 4 1
v1 a3 44 2 1
v1 a4 44 12 1
v1 a5 44 25 1
v3 a1 36 1 2
v3 a2 36 45 2
v3 a3 36 2 2
v3 a4 36 5 2
v3 a5 36 3 2
v2 a1 24 18 3
v2 a2 24 2 3
v2 a3 24 40 3
v2 a4 24 4 3
v2 a5 24 5 3
run;
data want (drop=_: i );
/* First pass, read a group and store in array _VALUE */
do _nx=1 by 1 until (last.varorder);
set have;
by varorder;
array _value{12};
_value{_nx}=var_value;
end;
array _withdraw {12} ; /* Indicate values to withdraw */
array _kurt_change {12}; /* Effect on kurtosis of each value */
/* Now do outer and inner loops to get kurtosis<=3 */
initial_kurtosis=kurtosis(of _value{*});
do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);
do i=1 to _nx;
if _value{i}^=. then do;
_withdraw{i}=_value{i};
_value{i}=.;
_kurt_change{i}=final_kurtosis-kurtosis(of _value{*});
_value{i}=_withdraw{i};
_withdraw{i}=.;
end;
end;
/* Find index corresponding to the maximum drop */
i=whichn(max(of _kurt_change{*}),of _kurt_change{*});
if _kurt_change{i}<0 then leave;
_withdraw{i}=_value{i};
_value{i}=. ;
call missing(of _kurt_change{*});
final_kurtosis=kurtosis(of _value{*});
end;
do i=1 by 1 until (last.varorder);
set have;
by varorder;
if _withdraw{i}^=. then drop_flag='Y';
else drop_flag='N';
output;
end;
format initial_kurtosis final_kurtosis 8.2 ;
run;
Notes:
if _kurt_change{i}<0 then leave;
right after setting i to the index of the maximum drop (but could it be negative?). But if it is known that withdrawing this observation always reduces kurtosis, then this statement can be eliminated.Thanks also for this, extensive solution. Let me work on this for a bit, to make sure I can follow.
@pink_poodle wrote:
@mkeintz, what does this mean:
do final_kurtosis=initial_kurtosis by 0
Thanks!
It means nothing (except you have defined an infinite loop).
But this:
do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);
means this:
final_kurtosis=initial_kurtosis;
do while (final_kurtosis>3 and n(of _value{*})>4);
In other words, it's just a way to collapse two statements into one.
I first saw this a while back in a comment from @hashman.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.