BookmarkSubscribeRSS Feed
himeqiu
Calcite | Level 5


Hi all,

I want to create clusters for 100,183 population based on their latitude and longitude.

I used following code to get 10 clusters:

proc fastclus data=NY_BRONX maxclusters=10 list distance out=cluster_NB outseed=out_NB;

mzb_indiv_id;

longitude latitude;

but frequency for each cluster are not equal sizes, as follows: 7861,12832,13437,3948,12543,11022,661,14572,15901,7406. I wonder are there any way that we can force the size of each cluster to be equal?

I really appreciate your help!

Thanks,

Jacky

8 REPLIES 8
ballardw
Super User

Cluster techniques attempt to identify groups of records with common characteristics. Does that mean you want to override the commonality to impose a specific size on groups of records?

If you have other variables to include in you cluster criteria that might make more sense.

Also, if copying code directly from SAS you need to double check after pasting in this forum, especially if using Internet Explorer as stuff disappears.

himeqiu
Calcite | Level 5

Thanks for you response.

Yes, the target is to keep each cluster the same size for analysis purpose.

I am sorry, are you not able to see the SAS code? I can see it from my side.

proc fastclus data=NY_BRONX maxclusters=10 list distance out=cluster_NB outseed=out_NB;

mzb_indiv_id;

longitude latitude;

run;



ballardw
Super User

I see

mzb_indiv_id;

longitude latitude;

Those are likely to be VAR, ID or BY variables but can't tell which ones are which.

So do you have any rule other than total size of cluster for moving records from one cluster to the next?

You might change to Proc Cluster and get the distances from the centroids as criteria for records to consider reassigning but that's going to be an iterative messy approach.

Perhaps increase the number of clusters (a bunch) and see if you get meaningful groups that total closer to your equal groups.

But consider, suppose you have one block with more than 15,000 people. How would you split that up?

What is the reasoning between having 10 exactly equal groups? You might consider using the clusters you have and sampling disproportionally using Proc Surveyselect to same size samples within each cluster.

Ksharp
Super User

Why not get the distance by function GEODISTANCE() ,then run proc fastclus ?

ballardw
Super User

I believe the OP doesn't know what he wants to measure Geodistance from. That would be the centroids of the clusters identified by the first Fastclus.

himeqiu
Calcite | Level 5

Thanks for your response, I have some questions about geodistance():

1. what are the benefits of geodistance()?

1. does geodistance guarantee that there will be equal size population counts for each cluster? Because there are rural and urban values that have different density of longitude/latitude. How can we guarantee that after geodistance, we have same size counts for each cluster.

Thanks!

ballardw
Super User

Geodistance would allow you to add a variable, or multiple variables to each record that would represent the distance from the location of interest to any other represented in latitude and longitude. You could use that distance in some assignment routine such as a clustering procedure that might allow you to refine the groups. My suggesting to use Proc Cluster and request the actual distance sort of uses the same start. A minor difference is that geodistance takes into account curvature of the earth which is more noticeable for longer distances. You would still need to come up with rules on how to shift members from one cluster to another but you would have another measure to work with.

himeqiu
Calcite | Level 5

Yes, I agree with your thoughts. My plan is to

1. get cluster centroid by using proc fastclus

2. reassign clusters, by trying to borrow records from nearest clusters if it is less than 10,000 and send records to other clusters if it is more than 10,000 to make it approximately the same size (10,000) within each cluster. But like you said, it will be an iterative messy approach. Because once we borrow counts from other clusters, the cluster centroid are prone to change accordingly, new cluster centrod might need to reassign.

Thanks!

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

SAS Enterprise Guide vs. SAS Studio

What’s the difference between SAS Enterprise Guide and SAS Studio? How are they similar? Just ask SAS’ Danny Modlin.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 8 replies
  • 2266 views
  • 0 likes
  • 3 in conversation