Desktop productivity for business analysts and programmers

Cluster on equal size for each group

Reply
New Contributor
Posts: 4

Cluster on equal size for each group


Hi all,

I want to create clusters for 100,183 population based on their latitude and longitude.

I used following code to get 10 clusters:

proc fastclus data=NY_BRONX maxclusters=10 list distance out=cluster_NB outseed=out_NB;

mzb_indiv_id;

longitude latitude;

but frequency for each cluster are not equal sizes, as follows: 7861,12832,13437,3948,12543,11022,661,14572,15901,7406. I wonder are there any way that we can force the size of each cluster to be equal?

I really appreciate your help!

Thanks,

Jacky

Super User
Posts: 11,104

Re: Cluster on equal size for each group

Cluster techniques attempt to identify groups of records with common characteristics. Does that mean you want to override the commonality to impose a specific size on groups of records?

If you have other variables to include in you cluster criteria that might make more sense.

Also, if copying code directly from SAS you need to double check after pasting in this forum, especially if using Internet Explorer as stuff disappears.

New Contributor
Posts: 4

Re: Cluster on equal size for each group

Thanks for you response.

Yes, the target is to keep each cluster the same size for analysis purpose.

I am sorry, are you not able to see the SAS code? I can see it from my side.

proc fastclus data=NY_BRONX maxclusters=10 list distance out=cluster_NB outseed=out_NB;

mzb_indiv_id;

longitude latitude;

run;



Super User
Posts: 11,104

Re: Cluster on equal size for each group

I see

mzb_indiv_id;

longitude latitude;

Those are likely to be VAR, ID or BY variables but can't tell which ones are which.

So do you have any rule other than total size of cluster for moving records from one cluster to the next?

You might change to Proc Cluster and get the distances from the centroids as criteria for records to consider reassigning but that's going to be an iterative messy approach.

Perhaps increase the number of clusters (a bunch) and see if you get meaningful groups that total closer to your equal groups.

But consider, suppose you have one block with more than 15,000 people. How would you split that up?

What is the reasoning between having 10 exactly equal groups? You might consider using the clusters you have and sampling disproportionally using Proc Surveyselect to same size samples within each cluster.

Super User
Posts: 9,856

Re: Cluster on equal size for each group

Why not get the distance by function GEODISTANCE() ,then run proc fastclus ?

Super User
Posts: 11,104

Re: Cluster on equal size for each group

I believe the OP doesn't know what he wants to measure Geodistance from. That would be the centroids of the clusters identified by the first Fastclus.

New Contributor
Posts: 4

Re: Cluster on equal size for each group

Thanks for your response, I have some questions about geodistance():

1. what are the benefits of geodistance()?

1. does geodistance guarantee that there will be equal size population counts for each cluster? Because there are rural and urban values that have different density of longitude/latitude. How can we guarantee that after geodistance, we have same size counts for each cluster.

Thanks!

Super User
Posts: 11,104

Re: Cluster on equal size for each group

Geodistance would allow you to add a variable, or multiple variables to each record that would represent the distance from the location of interest to any other represented in latitude and longitude. You could use that distance in some assignment routine such as a clustering procedure that might allow you to refine the groups. My suggesting to use Proc Cluster and request the actual distance sort of uses the same start. A minor difference is that geodistance takes into account curvature of the earth which is more noticeable for longer distances. You would still need to come up with rules on how to shift members from one cluster to another but you would have another measure to work with.

New Contributor
Posts: 4

Re: Cluster on equal size for each group

Yes, I agree with your thoughts. My plan is to

1. get cluster centroid by using proc fastclus

2. reassign clusters, by trying to borrow records from nearest clusters if it is less than 10,000 and send records to other clusters if it is more than 10,000 to make it approximately the same size (10,000) within each cluster. But like you said, it will be an iterative messy approach. Because once we borrow counts from other clusters, the cluster centroid are prone to change accordingly, new cluster centrod might need to reassign.

Thanks!

Ask a Question
Discussion stats
  • 8 replies
  • 817 views
  • 0 likes
  • 3 in conversation