About MikeCarter

MikeCarter · ‎03-17-2017

Sounds good !! This is interesting. Thank you very much Rob, I'll implment both versions. -Mike

MikeCarter · ‎03-17-2017

Awesome !! Yeah, didn't think about secondary objective putting it this way. It's simple and accurate. Thank you Rob, this was fun exercise. Am halfway through Cliques approach, I think I'll implement both the approaches. It's great to exchange ideas. -Thank you again for revising your code, Mike

MikeCarter · ‎03-17-2017

Hi Rob- Thank you very much for sharing your idea. This helps to know the options. But the way it works (maximin approach), I guess Charleston ends up in wrong cluster. Many such instances in a complex network and it might mean we lose out on potential opportunities to find find Clusters with maximized weights. My real data currently has about 400-500. But this might go up in future. As an alternative, was thinking perhaps to use PROC OPTNET. So given the nodes- 1) Calculate appropriate edges based on 70 mile distance. So, now we have a network where nodes are connected only if they are less than 70 miles apart. 2) Then use PROC OPTNET with Clique to find Cliques in above network. 3) Finally, eliminate common nodes programatically, starting with highest weight-- until no more common nodes are found. I'm not sure if this will work, but will try to implement. -Thanks for your approach Rob. It's an interesting appalication of OPTMODEL. -Mike,

MikeCarter · ‎03-16-2017

Hi Rob- Here is my sample dataset attached (excel file). Let me know if you can't access the attachment. 1) So, the goal is to Cluster these locations, in a way- "To minimize the number of Clusters, with a maximum Intra-Cluster distance of 70 miles". Meaning, maximum distance between any two nodes should not exceed 70 miles. 2) Now, there could be some instances where the locations may tend to fall in more than one cluster. In such cases, tie should be broken using "Node_Weight" column- to maximize the "Total Sum of Node Weights" of a cluster. One such example is designed here with first 5 cities- Gassaway, Quinwood, Charleston, Kenova and Crum. In this case [Gassaway, Quinwood, Charleston] and [Charleston, Kenova and Crum]-- both can be possible clusters following 70 miles rule, but Charleston is common to both. In such cases, Charleston should be assigned to earlier cluster [Gassaway, Quinwood, Charleston] - since the sum total node weight of this Cluster (from attached excel sheet) gets maximized [ 30(Gassaway) + 40(Quinwood) + 55(Charleston) = 125 ] instead of other Cluster [Charleston, Kenova and Crum] where the sum could have been [ 55(Charleston) + 25(Kenova) +35(Crum) =115]. In other words, am using "Node_Weight" as a greedy metric to break ties. Since we are using MILP, I thought this could be somehow baked into the Objective function. Let me know your thoughts on this, This will be very helpful, thank you very much Rob, -Mike

MikeCarter · ‎03-15-2017

Okay, that sounds good too. Just that some internal policies don't allow me to post actual locations. So, let me create a small, but a good representative dataset and then I will post it here. Will try to upload by tomorrow or day after. Thanks for help Rob, will work on getting the representative data, -Mike

MikeCarter · ‎03-15-2017

Hi Rob- Sure, MILP procedure sounds closer and more accuratre. Do you mind sharing your email id or any other id, so I can share the dataset? -Thank you, -Mike

MikeCarter · ‎03-15-2017

Hi there Rob, Sorry for late reply, I perhaps missed your your answer coming through. Data: May not be possible for me to get actual data here; but we can mock up any lat long locations. Solution: 1) I agree with your workaround, this could be a possible solution. Here is current code- Proc fastclus data= DATASET_NAME out=clust maxclusters=5; var Lat Long; run; How should I use Radius option here with "Miles"? So if I want to give a radius of 35 "miles", what is necessary format to be used? My Lat and Long are in normal degrees right now. 2) Secondly- I agree that this option is more restrictive in nature. What could be possible pitfalls of using such an approach? Do you know any other methods of accomplishing the same? Appreciate your help, -Thank you, Mike

MikeCarter · ‎02-14-2017

Hi everyone, I often come across a situation where I have bunch of different addresses (input data in Lat Long) mapped all over the nation. What i need to do is use clsuter these locations in a way that allows me to specify "maximum distance netween any two points within a clsuter". In other words, specify maximum intra-cluster distance. For example, to cluster all my individual points in a way that -- maximum distance between any two points within a cluster is 70 miles. I tired to search all the options online, using Proc Fastclus with Radius option etc, but nothing led me that allows me to specify intra cluster distance. Please let me know for any idea you may have, -Thank you

MikeCarter · ‎07-02-2015

Sure, thanks again. I shall extend this to 3 or 4 columns further. This is really helpful. -Mike

MikeCarter · ‎07-02-2015

Thanks Xia. A little complicated to understand in first-go, but will study this further. Ability to extract Max subset, for upto 3 or 4 columns, is what's needed. Thanks again for your time, -Mike

MikeCarter · ‎07-01-2015

@ PGStats, Okay, I see what you mean now. Got it. This makes sense. Important to note, Hash method doesn't always guarantee optimal (max) number of rows, although it can be feasible. Thanks to you & xia keshan again for your both your ideas, will think on these directions further. -Mike

MikeCarter · ‎06-30-2015

Thank you everyone for replying and reading the question. Xia and PGStats cracked it in two different ways, both are superb techniques. Thanks to all again, -Mike

MikeCarter · ‎06-30-2015

Hi Xia- this worked. Thank you for neat code, never used hash in SAS, will do a bit more reading on this technique. But thank you so much for your help. This worked for me. Apprecite it, -Mike

MikeCarter · ‎06-30-2015

Hello PGStats- this is great. Yes, I think it worked. I have never used OPTNET, great application. I think, I'll do a bit more research on this procedure to use it more effectively. A quick question- is there a way to apply Linear Assignment, if there were 3 or 4 columns? It's like solving, min wgt matching problem on tripartite or qudra-partite graphs? Not sure, if this is even practical. But evetually, I need to extend this technique to 4 columns-- where individual values in those 4 columns should be different. Does that make sense? Basically- the same output needed, with same condition, when applied over 3 or 4 columns. -Mike

MikeCarter · ‎06-29-2015

Sure, as I said- wanted outcome would be subset of rows, where individual column values are not equal to each other. City 1 City 2 A1 B1 A2 B2 Or City 1 City 2 A1 B2 A2 B1 Both these outputs are acceptable, because values in individual columns (City 1, City2) are not repeated. To extract both of them would be great. Proc SQL "Select Distinct City1, City2"- doesn't necessarily guarantee this, because although it may get us to distinct rows, the individual values will be different. But that's the subset I need.

Online Status	Offline
Date Last Visited	‎03-18-2017 11:26 PM

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

K Means Cluster with Specified Intra Cluster Distance

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

Re: K Means Cluster with Specified Intra Cluster Distance

K Means Cluster with Specified Intra Cluster Distance

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause

Re: PROC SQL or SAS Distinct Clause