Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- SAS Proc cluster or Fastclus with a monotonically increasing continuou...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 07-01-2019 11:04 AM
(860 views)

Hi,

I am trying to cluster a monotonically increasing variable into buckets. As an example, take logsalary from the sashelp baseball dataset

```
DATA baseball;
SET sashelp.baseball;
WHERE NOT MISSING(logsalary);
KEEP logSalary;
RUN;
```

Sort the data from smallest to largest

```
proc sort
data = baseball
out = FULL;
by logsalary;
RUN;
```

Now, use proc cluster to create optimal bins for logsalary

```
PROC CLUSTER
DATA = FULL
OUTTREE = cluster_logsalary
METHOD = ward
CCC PSEUDO PRINT = 25;
VAR logsalary;
RUN;
```

My issue is that the output generated creates bins; however, within these bins, logsalary is no longer monotonically increasing which I am making an essential criteria for my binning - Is there a way to bin logsalary with minimised variance but keep the monotonic characteristics?

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

It's not clear to me why you think this clustering of a single variable is not monotonic, what is the evidence in the output of PROC CLUSTER that says it is not monotonic? What even is the definition of monotonic here?

In general, clustering is used on multiple X variables, and there is no such criteria as monotonic that applies to multiple X variables in the clustering algorithms. Perhaps PROC HPBIN will give you what you want, as far as I know it should, since it operates on variables individually, and if you have values 10 20 30 40 50 60, it should not bin 10 and 60 together (if that's what you mean by monotonic).

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@782822 wrote:

I need the bins for logsalary to remain monotonically increasing i.e. the median value of each bin increases.

I still have no idea what you see in the PROC CLUSTER output that indicates things are not monotonic, and so without further explanation, I consider this statement to be questionable and in my opinion, incorrect.

PROC HPBIN does not work well with poorly distributed data and so, you get equally spaced bins with 1 bin containing 99% of the data

I think that the QUANTILE method in PROC HPBIN avoids this problem.

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

So, honestly, I can't help because I still don't know what you mean by "monotonic" in this context, and I also don't know what criteria you do want (I know you don't want QUANTILE and you don't want BUCKET, which is not the same as stating what you do want).

--

Paige Miller

Paige Miller

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Seems like you are looking for a way to estimate cutpoints on logsalary that would define homogeneous subsets. If the distribution of logsalary shows multiple modes, you might want to look at **proc fmm** to estimate the subcomponents of the logsalary distribution.

PG

Are you ready for the spotlight? We're accepting content ideas for **SAS Innovate 2025** to be held May 6-9 in Orlando, FL. The call is **open **until September 16. Read more here about **why** you should contribute and **what is in it** for you!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.