using kurtosis to remove outliers

geneshackman · Posted 09-08-2020 04:56 PM

For those who have been following my posts, here is the complete problem. I've been working out, as we go, how the process would go, and wasn't really sure before.

Here is a data set. In this example, there are three variables, V1, V2, V3. Each variable covers the same set of counties (a1-a5). V_other is the value of V1, V2, V3, by county. "Kurtosis" is the kurtosis for each variable. In this example, V1 has the largest kurtosis (44). V3 has the second largest kurtosis (36) and V2 has the third largest (24). Kurtosis is basically an indicator of outliers. In this example, V1 has the highest kurtosis, and it looks like counties a4 and a5 may be outliers, but if course, I don't know that yet. V3 has the second highest kurtosis, and it looks like county a2 might be an outlier. And for V2, counties a1 and a3 might be outliers. I made up all this data, so the kurtosis isn't really what it would be if I calculated it, but just for this example.

Now I want to eventually get, for V1, V2 and V3, sets of counties that have no outliers, that is, kurtosis <= 3.

So I think the sequence of events would be, for each V1, V2, V3, sort the counties by v_other, remove the county with the largest value, and recalculate the kurtosis. If it's larger than 3, then repeat, until kurtosis is <= 3 If possible, also keep a list of counties that had to be removed.

The complications are that before conducting this analysis, I don't know which variable will have the largest kurtosis, I don't know, within each variable, which counties will be the most outlier, nor which and how many counties will have to be removed for each variable.

Clear? Any suggestions appreciated.

variable	county	Kurtosis	v_other	VarOrder
v1	a1	44	3	1
v1	a2	44	4	1
v1	a3	44	2	1
v1	a4	44	12	1
v1	a5	44	25	1
v3	a1	36	1	2
v3	a2	36	45	2
v3	a3	36	2	2
v3	a4	36	5	2
v3	a5	36	3	2
v2	a1	24	18	3
v2	a2	24	2	3
v2	a3	24	40	3
v2	a4	24	4	3
v2	a5	24	5	3

Thanks

Gene

pink_poodle · Posted 09-08-2020 06:07 PM

Here is one approach using nested macros.

The inside macro, let us refer to it as %kur, would do what you say here in pseudo-code for one variable at a time:

sort the counties by v_other,
remove the county with the largest value,
recalculate the kurtosis.
repeat, until kurtosis is <= 3
keep a list of counties that had to be removed

The second macro is a %driver that iterates over variable list like so (notice that %kur is inside the %driver):

% let list = var1 var2 var3;

%macro driver(list);
  %let n = %sysfunc(countw(&list));
  %do j=1 %to &n;
       %let a = %scan(&list, &j);
               %kur(&a);
   %end;
 %mend;

geneshackman · Posted 09-08-2020 10:12 PM

Thanks, looks good, let me think about this for a bit.

mkeintz · Posted 09-08-2020 09:00 PM

As an operational task, you can, for each var_order:

transcribe the sequence of values into an array, by reading all records from a given var_order group.
generate an initial_kurtosis using the kurtosis function (with the array as an argument), and initialize final_kurtosis to the same value.
Do this outer Loop as long as final_kurtosis>3 and you have more than 4 values:
1. Inner Loop through individual values. You can test the effect of each value on final_kurtosis by dropping it, generating the reduction in kurtosis, putting it back, storing the potential change in another array, and proceeding to the next candidate, generating an array of potential reductions in kurtosis.
2. At the end of the inner loop, find the maximum change, and move the corresponding value to an array of withdrawn values, and update final_kurtosis.
At the end, of #3, re-read each observation in the group, outputting it with the new variables initial_kurtosis and final_kurtosis. And also make a drop_flag (='Y' or 'N') indicating whether that record value was dropped between the initial and final kurtosis.
~~At the end of generating a list of potential effects, find the largest and park it in an array of withdrawn values~~. (redundant)

Here's code using your data (for which only the second group starts out with kurtosis>3):

data have (drop=kurtosis);
  input variable :$2.	county :$2. 	Kurtosis 	var_value 	VarOrder;
datalines;
v1 	a1 	44 	3 	1
v1 	a2 	44 	4 	1
v1 	a3 	44 	2 	1
v1 	a4 	44 	12 	1
v1 	a5 	44 	25 	1
v3 	a1 	36 	1 	2
v3 	a2 	36 	45 	2
v3 	a3 	36 	2 	2
v3 	a4 	36 	5 	2
v3 	a5 	36 	3 	2
v2 	a1 	24 	18 	3
v2 	a2 	24 	2 	3
v2 	a3 	24 	40 	3
v2 	a4 	24 	4 	3
v2 	a5 	24 	5 	3
run;

data want (drop=_:  i );

  /* First pass, read a group and store in array _VALUE */
  do _nx=1 by 1 until (last.varorder);
    set have;
	by varorder;
	array _value{12};
	_value{_nx}=var_value;
  end;

  array _withdraw {12} ;   /* Indicate values to withdraw */
  array _kurt_change {12}; /* Effect on kurtosis of each value */

  /* Now do outer and inner loops to get kurtosis<=3 */
  initial_kurtosis=kurtosis(of _value{*});

  do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);
    do i=1 to _nx;
	  if _value{i}^=. then do;
	    _withdraw{i}=_value{i};
        _value{i}=.;
        _kurt_change{i}=final_kurtosis-kurtosis(of _value{*});
        _value{i}=_withdraw{i};
        _withdraw{i}=.;
      end;
    end;
	/* Find index corresponding to the maximum drop */
    i=whichn(max(of _kurt_change{*}),of _kurt_change{*}); 
    if _kurt_change{i}<0 then leave;

    _withdraw{i}=_value{i};
    _value{i}=. ;
    call missing(of _kurt_change{*});
    final_kurtosis=kurtosis(of _value{*});
  end;

  do i=1 by 1 until (last.varorder);
    set have;
	by varorder;
	if _withdraw{i}^=. then drop_flag='Y';
	else drop_flag='N';
	output;
  end;
  format initial_kurtosis  final_kurtosis  8.2 ;
run;

Notes:

I set each array size to 12. You should use a number that can accommodate the largest group population.
I use N(of _values{*}>4) in the outer loop specification, because the inner loop will reduce the number of elements by 1, and you need at least 4 to calculate the updated kurtosis.
This may not be the way to arrive at a global optimum, since this incrementally removes 1 element at a time. I don't know the structure of the problem well enough to know whether there is some group of simultaneous withdrawals that would be better than taking away one element at a time.
In fact, I don't even know whether taking an observation away always reduces kurtosis, so I have the
```
 if _kurt_change{i}<0 then leave;
```
right after setting i to the index of the maximum drop (but could it be negative?). But if it is known that withdrawing this observation always reduces kurtosis, then this statement can be eliminated.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

geneshackman · Posted 09-08-2020 10:13 PM

Thanks also for this, extensive solution. Let me work on this for a bit, to make sure I can follow.

pink_poodle · Posted 09-10-2020 11:37 AM

@mkeintz, what does this mean:

do final_kurtosis=initial_kurtosis by 0

Thanks!

mkeintz · Posted 09-10-2020 04:51 PM

@pink_poodle wrote:

@mkeintz, what does this mean:
do final_kurtosis=initial_kurtosis by 0
Thanks!

It means nothing (except you have defined an infinite loop).

But this:

do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);

means this:

final_kurtosis=initial_kurtosis;

do while (final_kurtosis>3 and n(of _value{*})>4);

In other words, it's just a way to collapse two statements into one.

I first saw this a while back in a comment from @hashman.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------