Efficiently Calculating Mean Distance within Cluster

Reply
Contributor
Posts: 64

Efficiently Calculating Mean Distance within Cluster

DATASET1 has 3 columns:  Latitude, Longitude, GroupID.

DATASET2 has 1 column, GroupID (a distinct list of groups).

 

My challenge is this - for each GroupID, compute the average distance from the center to each of the points and add it to the DATASET2 as a column.

 

Since I already have GroupID, and it wasn't created with SAS - I can't be solving my problem w/ any of the clustering procedures in SAS (that I know of) that are efficient in calculating distances.

 

I'm okay with using eucilidean distance since it's simple. I'm mostly concerned with the most efficient approach. 

 

Here is my idea, please feel free to offer other suggestions if there are better ways:

1. Looping macro - For Each ClusterID, Data step to calculate Center_Lat and Center_Lon, store results in table, join to DATASET1

2. DATASET1 now has 2 more columns: Center_Lat and Center_Lon. Simply do another datastep to compute each Dist_To_Center

3. Summarize this distance for each group

 

Thanks in advance for any suggestions

Valued Guide
Posts: 854

Re: Efficiently Calculating Mean Distance within Cluster

can you give a few lines of data as an example?

Contributor
Posts: 64

Re: Efficiently Calculating Mean Distance within Cluster

Certainly,

 

Latitude Longitude GroupID
36.0980903 -109.0386987 Group1
36.0984013 -109.0385617 Group1
36.0989923 -109.0300557 Group1
36.0976343 -109.0301467 Group1
36.0969993 -109.0333207 Group1
36.0993153 -109.0354437 Group1
36.0997053 -109.0357997 Group1
36.0976233 -109.0287737 Group1
36.1013493 -109.0335057 Group2
36.1023103 -109.0330507 Group2
36.1018293 -109.0341797 Group2
36.0861703 -109.0128447 Group2
36.1056053 -109.0212337 Group2
36.1065093 -109.0217417 Group2
36.1065093 -109.0217417 Group2
36.0983103 -109.0084197 Group2
36.1019023 -109.0176657 Group2
36.0975443 -109.0091287 Group2
36.0928063 -109.0243767 Group2
36.0990893 -109.0535547 Group2
36.0980763 -109.0530487 Group2
36.0962203 -109.0111137 Group2
36.1026333 -109.0195427 Group2
36.1030713 -109.0249407 Group3
36.0900623 -109.0227547 Group3
36.1044493 -109.0112987 Group3
36.1093803 -109.0201137 Group3
36.1085193 -109.0194697 Group3
36.1093893 -109.0194947 Group3
36.0871543 -109.0235497 Group3
36.0880483 -109.0244707 Group3
36.1016473 -109.0191277 Group3
36.0939833 -109.0551347 Group3
36.0936903 -109.0535577 Group3
36.0988083 -109.0595137 Group3
36.0920383 -109.0488857 Group3
36.0916593 -109.0474277 Group3
36.0921533 -109.0462577 Group3
36.1099053 -109.0181187 Group3
36.0988563 -109.0462907 Group3

 

 DATASET2

GroupID
Group1
Group2
Group3
Grand Advisor
Posts: 10,210

Re: Efficiently Calculating Mean Distance within Cluster

Try this.

The data have is to have something to calculate with. Use 'DK' if you want the distance in kilometers instead of the miles this generates.

 

%annomac;

data have;
input Latitude Longitude GroupID  $;
datalines;
36.0980903 -109.0386987 Group1 
36.0984013 -109.0385617 Group1 
36.0989923 -109.0300557 Group1 
36.0976343 -109.0301467 Group1 
36.0969993 -109.0333207 Group1 
36.0993153 -109.0354437 Group1 
36.0997053 -109.0357997 Group1 
36.0976233 -109.0287737 Group1 
36.1013493 -109.0335057 Group2 
36.1023103 -109.0330507 Group2 
36.1018293 -109.0341797 Group2 
36.0861703 -109.0128447 Group2 
36.1056053 -109.0212337 Group2 
36.1065093 -109.0217417 Group2 
36.1065093 -109.0217417 Group2 
36.0983103 -109.0084197 Group2 
36.1019023 -109.0176657 Group2 
36.0975443 -109.0091287 Group2 
36.0928063 -109.0243767 Group2 
36.0990893 -109.0535547 Group2 
36.0980763 -109.0530487 Group2 
36.0962203 -109.0111137 Group2 
36.1026333 -109.0195427 Group2 
36.1030713 -109.0249407 Group3 
36.0900623 -109.0227547 Group3 
36.1044493 -109.0112987 Group3 
36.1093803 -109.0201137 Group3 
36.1085193 -109.0194697 Group3 
36.1093893 -109.0194947 Group3 
36.0871543 -109.0235497 Group3 
36.0880483 -109.0244707 Group3 
36.1016473 -109.0191277 Group3 
36.0939833 -109.0551347 Group3 
36.0936903 -109.0535577 Group3 
36.0988083 -109.0595137 Group3 
36.0920383 -109.0488857 Group3 
36.0916593 -109.0474277 Group3 
36.0921533 -109.0462577 Group3 
36.1099053 -109.0181187 Group3 
36.0988563 -109.0462907 Group3 
;
run;

%centroid(%str(have(rename=(latitude=y longitude=x))),center,Groupid);

proc sql;
   create table want as
   select a.groupid,a.x as center_long, a.y as center_lat,  
          b.longitude, b.latitude, geodist(a.y,a.x,b.latitude,b.longitude,'DM') as distance
          from center as a join have as b on a.groupid=b.groupid;
quit;
Respected Advisor
Posts: 4,606

Re: Efficiently Calculating Mean Distance within Cluster

With lots of data, it might be better to keep it simple:

 

/* Assume dataset HAVE is sorted by GroupID */
proc means data=have noprint;
by GroupID;
var Latitude Longitude;
output out=center(drop=_: ) mean= / autoname;
run;

data want;
sumDist = 0;
nDist = 0;
do until (last.GroupID);
    merge have center; by GroupID;
    sumDist + geodist(Latitude, Longitude, Latitude_Mean, Longitude_Mean);
    nDist + 1;
    end;
Distance_Km_Mean = sumDist/nDist;
keep GroupID Distance_Km_Mean;
run;
PG
Contributor
Posts: 23

Re: Efficiently Calculating Mean Distance within Cluster

To turn it a bit around:

 

What if the groups were not yet defined but depended on the distance to the each other ?

 

Fx, taking the same indata, if we had another variable, x, I would like to calculate the cluster average of variable x based on all observations no further than 250 meters away. This would mean a dynamic and possibly different cluster group for every observation.

How would this be done?

Thanks Smiley Happy

 

data have1;		
infile cards;		
input Latitude     Longitude     x;		
cards;		
36.0980903	-109.0386987	1000
36.0984013	-109.0385617	1500
36.0989923	-109.0300557	2000
36.0976343	-109.0301467	750
36.0969993	-109.0333207	850
36.0993153	-109.0354437	900
36.0997053	-109.0357997	347
36.0976233	-109.0287737	650
36.1013493	-109.0335057	1100
36.1023103	-109.0330507	1200
36.1018293	-109.0341797	952
36.0861703	-109.0128447	652
36.1056053	-109.0212337	311
36.1065093	-109.0217417	1000
36.1065093	-109.0217417	1500
36.0983103	-109.0084197	2000
36.1019023	-109.0176657	750
36.0975443	-109.0091287	850
36.0928063	-109.0243767	900
36.0990893	-109.0535547	347
36.0980763	-109.0530487	650
36.0962203	-109.0111137	1100
36.1026333	-109.0195427	1200
36.1030713	-109.0249407	952
36.0900623	-109.0227547	652
36.1044493	-109.0112987	311
36.1093803	-109.0201137	1000
36.1085193	-109.0194697	1500
36.1093893	-109.0194947	2000
36.0871543	-109.0235497	750
36.0880483	-109.0244707	850
36.1016473	-109.0191277	900
36.0939833	-109.0551347	347
36.0936903	-109.0535577	650
36.0988083	-109.0595137	1100
36.0920383	-109.0488857	1200
36.0916593	-109.0474277	952
36.0921533	-109.0462577	652
36.1099053	-109.0181187	311
36.0988563	-109.0462907	5000
;		
run;		
Respected Advisor
Posts: 4,606

Re: Efficiently Calculating Mean Distance within Cluster

Different problem. Please post as a new question to get more attention.

PG
Grand Advisor
Posts: 17,325

Re: Efficiently Calculating Mean Distance within Cluster

How big is your dataset?

What is the point of dataset2, it doesn't seem to provide any information. 

Valued Guide
Posts: 854

Re: Efficiently Calculating Mean Distance within Cluster

I'm not sure if this is the correct calculation.  Take a look at the solution and see if this works for you.

 



data have1;
infile cards;
input Latitude     Longitude     GroupID$;
cards;
36.0980903     -109.0386987     Group1
36.0984013     -109.0385617     Group1
36.0989923     -109.0300557     Group1
36.0976343     -109.0301467     Group1
36.0969993     -109.0333207     Group1
36.0993153     -109.0354437     Group1
36.0997053     -109.0357997     Group1
36.0976233     -109.0287737     Group1
36.1013493     -109.0335057     Group2
36.1023103     -109.0330507     Group2
36.1018293     -109.0341797     Group2
36.0861703     -109.0128447     Group2
36.1056053     -109.0212337     Group2
36.1065093     -109.0217417     Group2
36.1065093     -109.0217417     Group2
36.0983103     -109.0084197     Group2
36.1019023     -109.0176657     Group2
36.0975443     -109.0091287     Group2
36.0928063     -109.0243767     Group2
36.0990893     -109.0535547     Group2
36.0980763     -109.0530487     Group2
36.0962203     -109.0111137     Group2
36.1026333     -109.0195427     Group2
36.1030713     -109.0249407     Group3
36.0900623     -109.0227547     Group3
36.1044493     -109.0112987     Group3
36.1093803     -109.0201137     Group3
36.1085193     -109.0194697     Group3
36.1093893     -109.0194947     Group3
36.0871543     -109.0235497     Group3
36.0880483     -109.0244707     Group3
36.1016473     -109.0191277     Group3
36.0939833     -109.0551347     Group3
36.0936903     -109.0535577     Group3
36.0988083     -109.0595137     Group3
36.0920383     -109.0488857     Group3
36.0916593     -109.0474277     Group3
36.0921533     -109.0462577     Group3
36.1099053     -109.0181187     Group3
36.0988563     -109.0462907     Group3
;

proc sql;
create table mid as
select distinct groupid,mean(diff_lat) as avg_lat,mean(diff_lon) as avg_lon
from(
select *,center_lat - latitude as diff_lat,center_lon-longitude as diff_lon
from(
select groupid,latitude,longitude, median(max_lat,min_lat) as center_lat, median(max_lon,min_lon) as center_lon
from(
select *,min(latitude) as min_lat,max(latitude) as max_lat,min(longitude) as min_lon,max(longitude) as max_lon
from have1
group by groupid)))
group by groupid;

Ask a Question
Discussion stats
  • 8 replies
  • 362 views
  • 1 like
  • 6 in conversation