For those who have been following my posts, here is the complete problem. I've been working out, as we go, how the process would go, and wasn't really sure before.
Here is a data set. In this example, there are three variables, V1, V2, V3. Each variable covers the same set of counties (a1-a5). V_other is the value of V1, V2, V3, by county. "Kurtosis" is the kurtosis for each variable. In this example, V1 has the largest kurtosis (44). V3 has the second largest kurtosis (36) and V2 has the third largest (24). Kurtosis is basically an indicator of outliers. In this example, V1 has the highest kurtosis, and it looks like counties a4 and a5 may be outliers, but if course, I don't know that yet. V3 has the second highest kurtosis, and it looks like county a2 might be an outlier. And for V2, counties a1 and a3 might be outliers. I made up all this data, so the kurtosis isn't really what it would be if I calculated it, but just for this example.
Now I want to eventually get, for V1, V2 and V3, sets of counties that have no outliers, that is, kurtosis <= 3.
So I think the sequence of events would be, for each V1, V2, V3, sort the counties by v_other, remove the county with the largest value, and recalculate the kurtosis. If it's larger than 3, then repeat, until kurtosis is <= 3 If possible, also keep a list of counties that had to be removed.
The complications are that before conducting this analysis, I don't know which variable will have the largest kurtosis, I don't know, within each variable, which counties will be the most outlier, nor which and how many counties will have to be removed for each variable.
Clear? Any suggestions appreciated.
variable
county
Kurtosis
v_other
VarOrder
v1
a1
44
3
1
v1
a2
44
4
1
v1
a3
44
2
1
v1
a4
44
12
1
v1
a5
44
25
1
v3
a1
36
1
2
v3
a2
36
45
2
v3
a3
36
2
2
v3
a4
36
5
2
v3
a5
36
3
2
v2
a1
24
18
3
v2
a2
24
2
3
v2
a3
24
40
3
v2
a4
24
4
3
v2
a5
24
5
3
Thanks
Gene
... View more