BookmarkSubscribeRSS Feed
geneshackman
Pyrite | Level 9

For those who have been following my posts, here is the complete problem. I've been working out, as we go, how the process would go, and wasn't really sure before. 

 

Here is a data set. In this example, there are three variables, V1, V2, V3. Each variable covers the same set of counties (a1-a5). V_other is the value of V1, V2, V3, by county. "Kurtosis" is the kurtosis for each variable. In this example, V1 has the largest kurtosis (44). V3 has the second largest kurtosis (36) and V2 has the third largest (24). Kurtosis is basically an indicator of outliers. In this example, V1 has the highest kurtosis, and it looks like counties a4 and a5 may be outliers, but if course, I don't know that yet. V3 has the second highest kurtosis, and it looks like county a2 might be an outlier. And for V2, counties a1 and a3 might be outliers. I made up all this data, so the kurtosis isn't really what it would be if I calculated it, but just for this example. 

 

Now I want to eventually get, for V1, V2 and V3, sets of counties that have no outliers, that is, kurtosis <= 3.

 

So I think the sequence of events would be, for each V1, V2, V3, sort the counties by v_other, remove the county with the largest value, and recalculate the kurtosis. If it's larger than 3, then repeat, until kurtosis is <= 3  If possible, also keep a list of counties that had to be removed.

 

The complications are that before conducting this analysis, I don't know which variable will have the largest kurtosis, I don't know, within each variable, which counties will be the most outlier, nor which and how many counties will have to be removed for each variable.

 

Clear? Any suggestions appreciated.

 

variable county Kurtosis v_other VarOrder
v1 a1 44 3 1
v1 a2 44 4 1
v1 a3 44 2 1
v1 a4 44 12 1
v1 a5 44 25 1
v3 a1 36 1 2
v3 a2 36 45 2
v3 a3 36 2 2
v3 a4 36 5 2
v3 a5 36 3 2
v2 a1 24 18 3
v2 a2 24 2 3
v2 a3 24 40 3
v2 a4 24 4 3
v2 a5 24 5 3

 

Thanks

 

Gene

 

8 REPLIES 8
pink_poodle
Barite | Level 11

Here is one approach using nested macros.

The inside macro, let us refer to it as %kur, would do what you say here in pseudo-code for one variable at a time:

    •  sort the counties by v_other,
    • remove the county with the largest value,
    • recalculate the kurtosis.
    • repeat, until kurtosis is <= 3 
    • keep a list of counties that had to be removed

The second macro is a %driver that iterates over variable list like so (notice that %kur is inside the %driver):

% let list = var1 var2 var3;

%macro driver(list);
  %let n = %sysfunc(countw(&list));
  %do j=1 %to &n;
       %let a = %scan(&list, &j);
               %kur(&a);
   %end;
 %mend;

 

geneshackman
Pyrite | Level 9
Thanks, looks good, let me think about this for a bit.
mkeintz
PROC Star

As an operational task, you can, for each var_order:

  1. transcribe the sequence of values into an array, by reading all records from a given var_order group.
  2. generate an initial_kurtosis using the kurtosis function (with the array as an argument), and initialize final_kurtosis to the same value.

  3. Do this outer Loop as long as final_kurtosis>3 and you have more than 4 values:
    1. Inner Loop through individual values.  You can test the effect of each value on final_kurtosis by dropping it, generating the reduction in kurtosis, putting it back, storing the potential change in another array, and proceeding to the next candidate, generating an array of potential reductions in kurtosis.
    2. At the end of the inner loop, find the maximum change, and move the corresponding value to an array of withdrawn values, and update final_kurtosis.
  4. At the end, of #3, re-read each observation in the group, outputting it with the new variables initial_kurtosis and final_kurtosis.  And also make a drop_flag (='Y' or 'N') indicating whether that record value was dropped between the initial and final kurtosis.
  5. At the end of generating a list of potential effects, find the largest and park it in an array of withdrawn values.  (redundant)

Here's code using your data (for which only the second group starts out with kurtosis>3): 

 

data have (drop=kurtosis);
  input variable :$2.	county :$2. 	Kurtosis 	var_value 	VarOrder;
datalines;
v1 	a1 	44 	3 	1
v1 	a2 	44 	4 	1
v1 	a3 	44 	2 	1
v1 	a4 	44 	12 	1
v1 	a5 	44 	25 	1
v3 	a1 	36 	1 	2
v3 	a2 	36 	45 	2
v3 	a3 	36 	2 	2
v3 	a4 	36 	5 	2
v3 	a5 	36 	3 	2
v2 	a1 	24 	18 	3
v2 	a2 	24 	2 	3
v2 	a3 	24 	40 	3
v2 	a4 	24 	4 	3
v2 	a5 	24 	5 	3
run;

data want (drop=_:  i );

  /* First pass, read a group and store in array _VALUE */
  do _nx=1 by 1 until (last.varorder);
    set have;
	by varorder;
	array _value{12};
	_value{_nx}=var_value;
  end;

  array _withdraw {12} ;   /* Indicate values to withdraw */
  array _kurt_change {12}; /* Effect on kurtosis of each value */

  /* Now do outer and inner loops to get kurtosis<=3 */
  initial_kurtosis=kurtosis(of _value{*});

  do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);
    do i=1 to _nx;
	  if _value{i}^=. then do;
	    _withdraw{i}=_value{i};
        _value{i}=.;
        _kurt_change{i}=final_kurtosis-kurtosis(of _value{*});
        _value{i}=_withdraw{i};
        _withdraw{i}=.;
      end;
    end;
	/* Find index corresponding to the maximum drop */
    i=whichn(max(of _kurt_change{*}),of _kurt_change{*}); 
    if _kurt_change{i}<0 then leave;

    _withdraw{i}=_value{i};
    _value{i}=. ;
    call missing(of _kurt_change{*});
    final_kurtosis=kurtosis(of _value{*});
  end;

  do i=1 by 1 until (last.varorder);
    set have;
	by varorder;
	if _withdraw{i}^=. then drop_flag='Y';
	else drop_flag='N';
	output;
  end;
  format initial_kurtosis  final_kurtosis  8.2 ;
run;

 

 

Notes:

  1. I set each array size to 12.  You should use a number that can accommodate the largest group population.
  2. I use N(of _values{*}>4) in the outer loop specification, because the inner loop will reduce the number of elements by 1, and you need at least 4 to calculate the updated kurtosis.
  3. This may not be the way to arrive at a global optimum, since this incrementally removes 1 element at a time.  I don't know the structure of the problem well enough to know whether there is some group of simultaneous withdrawals that would be better than taking away one element at a time.
  4. In fact, I don't even know whether taking an observation away always reduces kurtosis, so I have the
     if _kurt_change{i}<0 then leave;
    right after setting i to the index of the maximum drop (but could it be negative?).  But if it is known that withdrawing this observation always reduces kurtosis, then this statement can be eliminated.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
geneshackman
Pyrite | Level 9

Thanks also for this, extensive solution. Let me work on this for a bit, to make sure I can follow.

pink_poodle
Barite | Level 11

@mkeintz, what does this mean:

do final_kurtosis=initial_kurtosis by 0

Thanks!

mkeintz
PROC Star

@pink_poodle wrote:

@mkeintz, what does this mean:

do final_kurtosis=initial_kurtosis by 0

Thanks!


It means nothing (except you have defined an infinite loop). 

 

But this:

do final_kurtosis=initial_kurtosis by 0 while (final_kurtosis>3 and n(of _value{*})>4);

means this:

  final_kurtosis=initial_kurtosis;

  do while (final_kurtosis>3 and n(of _value{*})>4);

In other words, it's just a way to collapse two statements into one.

 

I first saw this a while back in a comment from @hashman.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
hashman
Ammonite | Level 13
It was quite a while back indeed ;).
Ksharp
Super User
Interesting question. maybe @Rick_SAS

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 1238 views
  • 4 likes
  • 5 in conversation