Re: Address Clustering using Geocodes

ddemilla · Posted 09-24-2015 02:21 PM

I am currently working on building a fraud detection algorithm that will track the historical number of loan applications we receive from any given local area (say 2-3 mile radius). This will then allow us to set a threshold value to filter out extreme values and detect potential fraud. I already have coordinate values for each of the street addresses and can calculate the distance between any pair of addresses by using the "geodist" function. The most straighforward way of solving this problem is to find the distance from every property to every other property and determine if each pair are within 2 miles of each other. Then I would assign a unique cluster to every network of connected addresses. Addresses that are not within a mile of any other house will receive a cluster of their own. The problem with this approach is that it is combinatorial and I don't want to have to churn through calculations for each pair. What I need is a clustering algorithm (similar to k-means) perhaps, that allows me to fix the radius of the cluster (distance from centroid to nearest neighbors). I have worked with "proc means" and I have brainstormed ideas about how to loop over the proc for different cluster numbers until the minimum threshold radius is reached but I'm not sure how best to implement this. Any ideas about a better approach or a straightforward implementation of k-means would be appreciated.

Thank you for your time and help!

Daniel

RobPratt · Posted 09-25-2015 04:26 PM

How many addresses do you have?

Can you share the data?

Have you considered using PROC FASTCLUS with the RADIUS= option?

ddemilla · Posted 09-28-2015 04:03 PM

This is a great suggestion Rob. However, the "RADIUS=" option only sets a minimum threshold for selecting new seeds. This accurately sets the minimum cluster radius but doesn't limit the maximum radius. So I may be able to have minimum clusters of a 2 mile radius, but some of them may still be the entire state of Florida for example. I need an option to fix the radius (set both minimum and maximum thresholds) at 2 miles. That way, I can analyze frequencies for local areas and determine whether we are seeing an abnormal number of applications from that specific 2 mile radius.

Thank you,

Daniel

RobPratt · Posted 09-28-2015 05:39 PM

OK. How many addresses, and can you share any data?

ddemilla · Posted 09-29-2015 10:52 AM

The current set only has 58, but the end goal is to run the algorithm on our entire portfolio of about 70,000 accounts. I can't share actual data since it includes household information but the data look like this:

Have:

Application_id Latitude Longitude

1000009999 -75.0000 50.0000

1000000888 -100.0000 30.0000

1000000777 -75.0000 50.0500

1000000666 -75.0000 50.0400

1000000555 -80.0000 35.0000

1000000444 -75.0300 50.0000

Want:

Application_id Latitude Longitude Cluster

1000009999 -75.0000 50.0000 1

1000000888 -100.0000 30.0000 2

1000000777 -75.0000 50.0500 1

1000000666 -75.0000 50.0400 1

1000000555 -80.0000 35.0000 3

1000000444 -75.0300 50.0000 1

So then I can count the number of observations in each cluster:

proc sql;

create table cluster_counts as

select

cluster,

count(application_id) as cluster_count

from want

group by

cluster

;quit;

Cluster cluster_count

1 4

2 1

3 1

Like I said, the easiest way to actually calculate the distance between every pair of coordinates. The algorithm would work by determining if an observation is within 2 miles of any other observation, if it isn't then create a new cluster, otherwise determine if the other observation has already been assigned to a cluster, if it has then assign the current cluster, if the matching observation has not been assigned then assign a new one. This way clusters are "chained" so node1 and node3 might actually be 3 miles apart, but they are both within 1.5 miles to node2 so they will be considered part of the same cluster.

The problem with this algorithm is that for each observation, I have to search the entire dataset, calculating distance between nodes, determining whether a cluster has already been assigned, and assign one if it hasn't... then move on to the next row and repeat. This is very computationally expensive and although I haven't written the algorithm, I am assuming that it is intractable for larger datasets (>1000 obs). I could be wrong of course and maybe it would be a good idea to mock up the clunky algorithm and then compare it any alternatives...

RobPratt · Posted 09-29-2015 11:30 PM

What you are describing are the connected components of a graph with a node for each account and an edge between two nodes if their GEODIST is at most 2. You can calculate connected components by using PROC OPTNET (or the network solver in PROC OPTMODEL) in SAS/OR. The algorithm is O(m), where m is the number of edges. Note that Latitude must be between -90 and 90 for GEODIST, so I changed the -100 to -90.

GEODIST:

http://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#n1korpfg2e1...

connected components in PROC OPTNET:

http://support.sas.com/documentation/cdl/en/ornoaug/68159/HTML/default/viewer.htm#ornoaug_optnet_det...

connected components in PROC OPTMODEL:

http://support.sas.com/documentation/cdl/en/ormpug/68156/HTML/default/viewer.htm#ormpug_networksolve...

data have;
   input Application_id Latitude Longitude;
   datalines;
1000009999 -75.0000 50.0000
1000000888 -90.0000 30.0000
1000000777 -75.0000 50.0500
1000000666 -75.0000 50.0400
1000000555 -80.0000 35.0000
1000000444 -75.0300 50.0000
;

proc sql;
   create table edges as
   select 
      a.Application_id as from, 
      b.Application_id as to,
      geodist(a.Latitude,a.Longitude,b.Latitude,b.Longitude,'M') as weight
   from have as a, have as b
   where a.Application_id < b.Application_id
      and calculated weight <= 2;
quit;

proc optnet data_nodes=have data_links=edges out_nodes=out_nodes;
   data_nodes_var node=Application_id;
   concomp;
run;

ddemilla · Posted 09-30-2015 02:42 PM

I was able to achieve the desired results using a very ugly bit of code that calculates the pairwise distances between accounts and then uses proc modeclus. I am using the "clustering with uniform-kernel density estimates" section from this page as a reference:

https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_modeclus_se...

I am interested in clusters that are not connected also though since we often receive multiple applications from the same address so these would need to be included in the frequency analysis. Here is the code that I used to create what I wanted:

data have;
format latitude 8.5;
format longitude 8.5;
input Application_id $10. Latitude Longitude;
datalines;
1000000999 -75.0000 50.0000
1000000888 -90.0000 30.0000
1000000777 -75.0000 50.0500
1000000666 -75.0000 50.0400
1000000555 -80.0000 35.0000
1000000444 -75.0300 50.0000
;

/*Transfer data to macro arrays*/
proc sql noprint;
select application_id, latitude, longitude
into :application_ids separated by ' dist_', :latitudes separated by ' ', :longitudes separated by ' '
from have;
;
%let n=&sqlobs
;quit;

/*Fix first observation*/

%let application_ids=%sysfunc(cat(dist_,&application_ids));

/*Calculate pairwise distances between coordinates. Similar to first table in example above.*/

%macro Distance;
data distances (type=distance);
array dist_vars {&n} &application_ids;
%let s=1;
%do i=1 %to &n;
id_variable=scan("&application_ids.",&i.,' ');
%do j=1 %to &s;
%if &j.=&s. %then dist_vars{&j}=0;;
%if &j ne &s %then %do;
%let lat1=%scan(&latitudes,&i,' ');
%let lat2=%scan(&latitudes,&j,' ');
%let long1=%scan(&longitudes,&i,' ');
%let long2=%scan(&longitudes,&j,' ');
%put Lat1: &lat1 Long1: &long1 Lat2: &lat2 Long2: &long2;
dist_vars{&j}=geodist(input(compress("&lat1"),12.7),input(compress("&long1"),12.7),
input(compress("&lat2"),12.7),input(compress("&long2"),12.7),'M');
put dist_vars{&j};
%end;
%end;
%let j=1;
%let s=%eval(&s+1);
output;
%end;
run;
%mend;

%Distance;

/*Fill in other half of distance matrix*/

proc transpose out=tran;
copy id_variable;
run;

data mileages(type=distance);
merge distances tran;
array var{&n} &application_ids;
array col[*] col1-col&n;
do i=1 to &n;
var[i]=sum(var[i],col[i]);
end;
drop col1-col&n _name_ i;
run;

/*Calculate clusters*/

proc modeclus data=mileages out=clusters all m=1 r=2;
id id_variable;
run;

/*Count cluster frequencies*/

proc sql;
create table cluster_counts as
select
cluster,
count(id_variable) as cnt
from clusters
group by
cluster
;quit;

This gets me what I want but obviously the business of transferring everything to a macro array won't work for larger datasets since the character limit for macro variables is 65,534. Also, I still have to calculate pairwise distances which requires n^2/2 iterations for any calculation I have to make (e.g. geodist()). I wasn't sure if there is a better solution out there. Also, I do not have SAS/OR so I am not sure if this is the proper thread to be posting to. I can move it elsewhere if necessary.

Thank you for all of your time and effort Rob!

Daniel

Address Clustering using Geocodes