New Contributor
Posts: 2

# Efficient minimum distance

Hi,

I have two datasets, one of users and one of agents, both datasets have longitudes and latitudes. I want to find the closest agent to each user. I know I could just merge the two datasets together, find the distance between every user and each agent and select the shortest from each list but this seems like a very inefficient method to me, I have tried it with a sample from my datasets (100 agents and 100k users) but this takes a long time and a lot of processing power so I don't really want to do it with full dataset.

To merge the data I was using proc SQL and  I am not doing any sort of fancy distance calculation, just the Euclidian distance.

Is there anything I can do to pre process my data to speed the process up. Can anybody help me with this? I have put and example of what my datasets might look like below.

User_IDUser_XUser_Y
U50032587974156
U60429785296354
U97234568275391
AgentIDAgent_xAgent_y
A10015363232133
A10023566512345
A10036345385464
Super Contributor
Posts: 578

## Re: Efficient minimum distance

I think there is no way around generating a cartesian product of each of the sets.  We sometimes put a maximum distance so we don't have to handle extraneous values, but you run the risk of having some records not matched.

proc sql;

create table tmp as

select

t1.user_id,

t2.agentid,

geodist(t1.user_x, t1.user_y, t2.agent_x, t2.agent_y) as distance

from

users t1, agents t2

where calculated distance < 25;

quit;

proc sort in=tmp out=tmp;

by user_id desending distance;

run;

proc sort in=tmp out=want nodupkey;

by user_id;

run;

Posts: 3,167

## Re: Efficient minimum distance

Concurred with , No silver bullet. Brute force of Cartesian product is the only force you can rely on according to your description.  However, in term of performance, many have found that Hash() may have an edge over Proc SQL.

In the following code, smaller table ( presumably Agent) is loaded into Hash(), and you are using Geodist() to get the distance. I have kept all the intermediary variables for your reference, drop or keep them as you will. Try it and let us know if you find the performance is manageable.

data want;

if _n_=1 then do;

if 0 then set agent;

declare hash agt(dataset:'agent');

agt.definekey(all:'y');

agt.definedone();

declare hiter hagt('agt');

end;

set user;

distance=constant('big');

do rc=hagt.first() by 0 while (rc=0);

_dis=geodist(User_X, User_Y, Agent_x, Agent_y);

if distance >=_dis then do;

distance=_dis;

Min_agent=agentid;

Min_x=Agent_x;

Min_y=Agent_y;

end;

rc=hagt.next();

end;

run;

Haikuo

New Contributor
Posts: 2