Hi,
I am trying to cluster a monotonically increasing variable into buckets. As an example, take logsalary from the sashelp baseball dataset
DATA baseball;
SET sashelp.baseball;
WHERE NOT MISSING(logsalary);
KEEP logSalary;
RUN;
Sort the data from smallest to largest
proc sort
data = baseball
out = FULL;
by logsalary;
RUN;
Now, use proc cluster to create optimal bins for logsalary
PROC CLUSTER
DATA = FULL
OUTTREE = cluster_logsalary
METHOD = ward
CCC PSEUDO PRINT = 25;
VAR logsalary;
RUN;
My issue is that the output generated creates bins; however, within these bins, logsalary is no longer monotonically increasing which I am making an essential criteria for my binning - Is there a way to bin logsalary with minimised variance but keep the monotonic characteristics?
It's not clear to me why you think this clustering of a single variable is not monotonic, what is the evidence in the output of PROC CLUSTER that says it is not monotonic? What even is the definition of monotonic here?
In general, clustering is used on multiple X variables, and there is no such criteria as monotonic that applies to multiple X variables in the clustering algorithms. Perhaps PROC HPBIN will give you what you want, as far as I know it should, since it operates on variables individually, and if you have values 10 20 30 40 50 60, it should not bin 10 and 60 together (if that's what you mean by monotonic).
I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.
PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data
@782822 wrote:
I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.
I still have no idea what you see in the PROC CLUSTER output that indicates things are not monotonic, and so without further explanation, I consider this statement to be questionable and in my opinion, incorrect.
PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data
I think that the QUANTILE method in PROC HPBIN avoids this problem.
I have also used quantile in the past but would like an alternative to having equal number of obs per bucket
So, honestly, I can't help because I still don't know what you mean by "monotonic" in this context, and I also don't know what criteria you do want (I know you don't want QUANTILE and you don't want BUCKET, which is not the same as stating what you do want).
Seems like you are looking for a way to estimate cutpoints on logsalary that would define homogeneous subsets. If the distribution of logsalary shows multiple modes, you might want to look at proc fmm to estimate the subcomponents of the logsalary distribution.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.