BookmarkSubscribeRSS Feed
ddemilla
Fluorite | Level 6

I am currently working on building a fraud detection algorithm that will track the historical number of loan applications we receive from any given local area (say 2-3 mile radius). This will then allow us to set a threshold value to filter out extreme values and detect potential fraud. I already have coordinate values for each of the street addresses and can calculate the distance between any pair of addresses by using the "geodist" function. The most straighforward way of solving this problem is to find the distance from every property to every other property and determine if each pair are within 2 miles of each other. Then I would assign a unique cluster to every network of connected addresses. Addresses that are not within a mile of any other house will receive a cluster of their own. The problem with this approach is that it is combinatorial and I don't want to have to churn through calculations for each pair. What I need is a clustering algorithm (similar to k-means) perhaps, that allows me to fix the radius of the cluster (distance from centroid to nearest neighbors). I have worked with "proc means" and I have brainstormed ideas about how to loop over the proc for different cluster numbers until the minimum threshold radius is reached but I'm not sure how best to implement this. Any ideas about a better approach or a straightforward implementation of k-means would be appreciated.

 

Thank you for your time and help!

Daniel

6 REPLIES 6
RobPratt
SAS Super FREQ

How many addresses do you have?

 

Can you share the data?

 

Have you considered using PROC FASTCLUS with the RADIUS= option?

ddemilla
Fluorite | Level 6

This is a great suggestion Rob. However, the "RADIUS=" option only sets a minimum threshold for selecting new seeds. This accurately sets the minimum cluster radius but doesn't limit the maximum radius. So I may be able to have minimum clusters of a 2 mile radius, but some of them may still be the entire state of Florida for example. I need an option to fix the radius (set both minimum and maximum thresholds) at 2 miles. That way, I can analyze frequencies for local areas and determine whether we are seeing an abnormal number of applications from that specific 2 mile radius. 

 

Thank you,

Daniel

RobPratt
SAS Super FREQ
OK. How many addresses, and can you share any data?
ddemilla
Fluorite | Level 6

The current set only has 58, but the end goal is to run the algorithm on our entire portfolio of about 70,000 accounts. I can't share actual data since it includes household information but the data look like this:

 

Have:

Application_id    Latitude    Longitude

1000009999         -75.0000              50.0000

1000000888         -100.0000            30.0000

1000000777         -75.0000              50.0500

1000000666         -75.0000              50.0400

1000000555         -80.0000             35.0000

1000000444         -75.0300              50.0000

 

Want:

Application_id    Latitude    Longitude                Cluster   

1000009999         -75.0000              50.0000                1

1000000888         -100.0000            30.0000                2  

1000000777         -75.0000              50.0500                1

1000000666         -75.0000              50.0400                1

1000000555         -80.0000             35.0000                 3 

1000000444         -75.0300              50.0000                1

 

So then I can count the number of observations in each cluster:

proc sql;

create table cluster_counts as

select

cluster,

count(application_id) as cluster_count

from want

group by

cluster

;quit;

 

Cluster          cluster_count

1                              4

2                              1

3                              1

 

Like I said, the easiest way to actually calculate the distance between every pair of coordinates. The algorithm would work by determining if an observation is within 2 miles of any other observation, if it isn't then create a new cluster, otherwise determine if the other observation has already been assigned to a cluster, if it has then assign the current cluster, if the matching observation has not been assigned then assign a new one. This way clusters are "chained" so node1 and node3 might actually be 3 miles apart, but they are both within 1.5 miles to node2 so they will be considered part of the same cluster.

 

The problem with this algorithm is that for each observation, I have to search the entire dataset, calculating distance between nodes, determining whether a cluster has already been assigned, and assign one if it hasn't... then move on to the next row and repeat. This is very computationally expensive and although I haven't written the algorithm, I am assuming that it is intractable for larger datasets (>1000 obs). I could be wrong of course and maybe it would be a good idea to mock up the clunky algorithm and then compare it any alternatives...

RobPratt
SAS Super FREQ

What you are describing are the connected components of a graph with a node for each account and an edge between two nodes if their GEODIST is at most 2.  You can calculate connected components by using PROC OPTNET (or the network solver in PROC OPTMODEL) in SAS/OR.  The algorithm is O(m), where m is the number of edges.  Note that Latitude must be between -90 and 90 for GEODIST, so I changed the -100 to -90.

 

GEODIST:

http://support.sas.com/documentation/cdl/en/lefunctionsref/67960/HTML/default/viewer.htm#n1korpfg2e1...

 

connected components in PROC OPTNET:

http://support.sas.com/documentation/cdl/en/ornoaug/68159/HTML/default/viewer.htm#ornoaug_optnet_det...

 

connected components in PROC OPTMODEL:

http://support.sas.com/documentation/cdl/en/ormpug/68156/HTML/default/viewer.htm#ormpug_networksolve...

 

data have;
   input Application_id Latitude Longitude;
   datalines;
1000009999 -75.0000 50.0000
1000000888 -90.0000 30.0000
1000000777 -75.0000 50.0500
1000000666 -75.0000 50.0400
1000000555 -80.0000 35.0000
1000000444 -75.0300 50.0000
;

proc sql;
   create table edges as
   select 
      a.Application_id as from, 
      b.Application_id as to,
      geodist(a.Latitude,a.Longitude,b.Latitude,b.Longitude,'M') as weight
   from have as a, have as b
   where a.Application_id < b.Application_id
      and calculated weight <= 2;
quit;

proc optnet data_nodes=have data_links=edges out_nodes=out_nodes;
   data_nodes_var node=Application_id;
   concomp;
run;
ddemilla
Fluorite | Level 6

I was able to achieve the desired results using a very ugly bit of code that calculates the pairwise distances between accounts and then uses proc modeclus. I am using the "clustering with uniform-kernel density estimates" section from this page as a reference:

 

https://support.sas.com/documentation/cdl/en/statug/63347/HTML/default/viewer.htm#statug_modeclus_se...

 

I am interested in clusters that are not connected also though since we often receive multiple applications from the same address so these would need to be included in the frequency analysis. Here is the code that I used to create what I wanted:

 

data have;
format latitude 8.5;
format longitude 8.5;
input Application_id $10. Latitude Longitude;
datalines;
1000000999 -75.0000 50.0000
1000000888 -90.0000 30.0000
1000000777 -75.0000 50.0500
1000000666 -75.0000 50.0400
1000000555 -80.0000 35.0000
1000000444 -75.0300 50.0000
;

 

/*Transfer data to macro arrays*/
proc sql noprint;
select application_id, latitude, longitude
into :application_ids separated by ' dist_', :latitudes separated by ' ', :longitudes separated by ' '
from have;
;
%let n=&sqlobs
;quit;

 

/*Fix first observation*/

%let application_ids=%sysfunc(cat(dist_,&application_ids));

 

/*Calculate pairwise distances between coordinates. Similar to first table in example above.*/

%macro Distance;
data distances (type=distance);
array dist_vars {&n} &application_ids;
%let s=1;
%do i=1 %to &n;
id_variable=scan("&application_ids.",&i.,' ');
%do j=1 %to &s;
%if &j.=&s. %then dist_vars{&j}=0;;
%if &j ne &s %then %do;
%let lat1=%scan(&latitudes,&i,' ');
%let lat2=%scan(&latitudes,&j,' ');
%let long1=%scan(&longitudes,&i,' ');
%let long2=%scan(&longitudes,&j,' ');
%put Lat1: &lat1 Long1: &long1 Lat2: &lat2 Long2: &long2;
dist_vars{&j}=geodist(input(compress("&lat1"),12.7),input(compress("&long1"),12.7),
input(compress("&lat2"),12.7),input(compress("&long2"),12.7),'M');
put dist_vars{&j};
%end;
%end;
%let j=1;
%let s=%eval(&s+1);
output;
%end;
run;
%mend;

%Distance;

 

/*Fill in other half of distance matrix*/

proc transpose out=tran;
copy id_variable;
run;

 

data mileages(type=distance);
merge distances tran;
array var{&n} &application_ids;
array col[*] col1-col&n;
do i=1 to &n;
var[i]=sum(var[i],col[i]);
end;
drop col1-col&n _name_ i;
run;

 

/*Calculate clusters*/

proc modeclus data=mileages out=clusters all m=1 r=2;
id id_variable;
run;

 

/*Count cluster frequencies*/

proc sql;
create table cluster_counts as
select
cluster,
count(id_variable) as cnt
from clusters
group by
cluster
;quit;

 

This gets me what I want but obviously the business of transferring everything to a macro array won't work for larger datasets since the character limit for macro variables is 65,534. Also, I still have to calculate pairwise distances which requires n^2/2 iterations for any calculation I have to make (e.g. geodist()). I wasn't sure if there is a better solution out there. Also, I do not have SAS/OR so I am not sure if this is the proper thread to be posting to. I can move it elsewhere if necessary.

 

Thank you for all of your time and effort Rob!

 

Daniel 

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 5364 views
  • 2 likes
  • 2 in conversation