BookmarkSubscribeRSS Feed
782822
Calcite | Level 5

Hi,

I am trying to cluster a monotonically increasing variable into buckets. As an example, take logsalary from the sashelp baseball dataset

DATA baseball;
	SET sashelp.baseball;
	WHERE NOT MISSING(logsalary);
	KEEP logSalary;
RUN;

 

Sort the data from smallest to largest

proc sort
	data = baseball
        out = FULL;
	by logsalary;
RUN;

Now, use proc cluster to create optimal bins for logsalary

 

PROC CLUSTER 
	DATA = FULL
	OUTTREE = cluster_logsalary
	METHOD = ward
	CCC PSEUDO PRINT = 25; 
	VAR logsalary; 
RUN;

My issue is that the output generated creates bins; however, within these bins, logsalary is no longer monotonically increasing which I am making an essential criteria for my binning - Is there a way to bin logsalary with minimised variance but keep the monotonic characteristics?

6 REPLIES 6
PaigeMiller
Diamond | Level 26

It's not clear to me why you think this clustering of a single variable is not monotonic, what is the evidence in the output of PROC CLUSTER that says it is not monotonic? What even is the definition of monotonic here?

 

In general, clustering is used on multiple X variables, and there is no such criteria as monotonic that applies to multiple X variables in the clustering algorithms. Perhaps PROC HPBIN will give you what you want, as far as I know it should, since it operates on variables individually, and if you have values 10 20 30 40 50 60, it should not bin 10 and 60 together (if that's what you mean by monotonic).

--
Paige Miller
782822
Calcite | Level 5

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.

 

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

PaigeMiller
Diamond | Level 26

@782822 wrote:

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.


I still have no idea what you see in the PROC CLUSTER output that indicates things are not monotonic, and so without further explanation, I consider this statement to be questionable and in my opinion, incorrect. 

 

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

 

I think that the QUANTILE method in PROC HPBIN avoids this problem.

--
Paige Miller
782822
Calcite | Level 5

I have also used quantile in the past but would like an alternative to having equal number of obs per bucket

PaigeMiller
Diamond | Level 26

So, honestly, I can't help because I still don't know what you mean by "monotonic" in this context, and I also don't know what criteria you do want (I know you don't want QUANTILE and you don't want BUCKET, which is not the same as stating what you do want).

--
Paige Miller
PGStats
Opal | Level 21

Seems like you are looking for a way to estimate cutpoints on logsalary that would define homogeneous subsets. If the distribution of logsalary shows multiple modes, you might want to look at proc fmm to estimate the subcomponents of the logsalary distribution.

PG

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 16. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 861 views
  • 0 likes
  • 3 in conversation