I was able to achieve the desired results using a very ugly bit of code that calculates the pairwise distances between accounts and then uses proc modeclus. I am using the "clustering with uniform-kernel density estimates" section from this page as a reference:
https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_modeclus_sect020.htm
I am interested in clusters that are not connected also though since we often receive multiple applications from the same address so these would need to be included in the frequency analysis. Here is the code that I used to create what I wanted:
data have; format latitude 8.5; format longitude 8.5; input Application_id $10. Latitude Longitude; datalines; 1000000999 -75.0000 50.0000 1000000888 -90.0000 30.0000 1000000777 -75.0000 50.0500 1000000666 -75.0000 50.0400 1000000555 -80.0000 35.0000 1000000444 -75.0300 50.0000 ;
/*Transfer data to macro arrays*/ proc sql noprint; select application_id, latitude, longitude into :application_ids separated by ' dist_', :latitudes separated by ' ', :longitudes separated by ' ' from have; ; %let n=&sqlobs ;quit;
/*Fix first observation*/
%let application_ids=%sysfunc(cat(dist_,&application_ids));
/*Calculate pairwise distances between coordinates. Similar to first table in example above.*/
%macro Distance; data distances (type=distance); array dist_vars {&n} &application_ids; %let s=1; %do i=1 %to &n; id_variable=scan("&application_ids.",&i.,' '); %do j=1 %to &s; %if &j.=&s. %then dist_vars{&j}=0;; %if &j ne &s %then %do; %let lat1=%scan(&latitudes,&i,' '); %let lat2=%scan(&latitudes,&j,' '); %let long1=%scan(&longitudes,&i,' '); %let long2=%scan(&longitudes,&j,' '); %put Lat1: &lat1 Long1: &long1 Lat2: &lat2 Long2: &long2; dist_vars{&j}=geodist(input(compress("&lat1"),12.7),input(compress("&long1"),12.7), input(compress("&lat2"),12.7),input(compress("&long2"),12.7),'M'); put dist_vars{&j}; %end; %end; %let j=1; %let s=%eval(&s+1); output; %end; run; %mend;
%Distance;
/*Fill in other half of distance matrix*/
proc transpose out=tran; copy id_variable; run;
data mileages(type=distance); merge distances tran; array var{&n} &application_ids; array col[*] col1-col&n; do i=1 to &n; var[i]=sum(var[i],col[i]); end; drop col1-col&n _name_ i; run;
/*Calculate clusters*/
proc modeclus data=mileages out=clusters all m=1 r=2; id id_variable; run;
/*Count cluster frequencies*/
proc sql; create table cluster_counts as select cluster, count(id_variable) as cnt from clusters group by cluster ;quit;
This gets me what I want but obviously the business of transferring everything to a macro array won't work for larger datasets since the character limit for macro variables is 65,534. Also, I still have to calculate pairwise distances which requires n^2/2 iterations for any calculation I have to make (e.g. geodist()). I wasn't sure if there is a better solution out there. Also, I do not have SAS/OR so I am not sure if this is the proper thread to be posting to. I can move it elsewhere if necessary.
Thank you for all of your time and effort Rob!
Daniel
... View more