Calcite | Level 5

SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

Hi,

I am trying to cluster a monotonically increasing variable into buckets. As an example, take logsalary from the sashelp baseball dataset

``````DATA baseball;
SET sashelp.baseball;
WHERE NOT MISSING(logsalary);
KEEP logSalary;
RUN;``````

Sort the data from smallest to largest

``````proc sort
data = baseball
out = FULL;
by logsalary;
RUN;``````

Now, use proc cluster to create optimal bins for logsalary

``````PROC CLUSTER
DATA = FULL
OUTTREE = cluster_logsalary
METHOD = ward
CCC PSEUDO PRINT = 25;
VAR logsalary;
RUN;``````

My issue is that the output generated creates bins; however, within these bins, logsalary is no longer monotonically increasing which I am making an essential criteria for my binning - Is there a way to bin logsalary with minimised variance but keep the monotonic characteristics?

6 REPLIES 6
Diamond | Level 26

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

It's not clear to me why you think this clustering of a single variable is not monotonic, what is the evidence in the output of PROC CLUSTER that says it is not monotonic? What even is the definition of monotonic here?

In general, clustering is used on multiple X variables, and there is no such criteria as monotonic that applies to multiple X variables in the clustering algorithms. Perhaps PROC HPBIN will give you what you want, as far as I know it should, since it operates on variables individually, and if you have values 10 20 30 40 50 60, it should not bin 10 and 60 together (if that's what you mean by monotonic).

--
Paige Miller
Calcite | Level 5

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

Diamond | Level 26

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

@782822 wrote:

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.

I still have no idea what you see in the PROC CLUSTER output that indicates things are not monotonic, and so without further explanation, I consider this statement to be questionable and in my opinion, incorrect.

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

I think that the QUANTILE method in PROC HPBIN avoids this problem.

--
Paige Miller
Calcite | Level 5

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

I have also used quantile in the past but would like an alternative to having equal number of obs per bucket

Diamond | Level 26

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

So, honestly, I can't help because I still don't know what you mean by "monotonic" in this context, and I also don't know what criteria you do want (I know you don't want QUANTILE and you don't want BUCKET, which is not the same as stating what you do want).

--
Paige Miller
Opal | Level 21

Re: SAS Proc cluster or Fastclus with a monotonically increasing continuous variable

Seems like you are looking for a way to estimate cutpoints on logsalary that would define homogeneous subsets. If the distribution of logsalary shows multiple modes, you might want to look at proc fmm to estimate the subcomponents of the logsalary distribution.

PG
Discussion stats
• 6 replies
• 861 views
• 0 likes
• 3 in conversation