04-16-2013 04:58 AM
I have two datasets, one of users and one of agents, both datasets have longitudes and latitudes. I want to find the closest agent to each user. I know I could just merge the two datasets together, find the distance between every user and each agent and select the shortest from each list but this seems like a very inefficient method to me, I have tried it with a sample from my datasets (100 agents and 100k users) but this takes a long time and a lot of processing power so I don't really want to do it with full dataset.
To merge the data I was using proc SQL and I am not doing any sort of fancy distance calculation, just the Euclidian distance.
Is there anything I can do to pre process my data to speed the process up. Can anybody help me with this? I have put and example of what my datasets might look like below.
04-16-2013 08:01 AM
I think there is no way around generating a cartesian product of each of the sets. We sometimes put a maximum distance so we don't have to handle extraneous values, but you run the risk of having some records not matched.
create table tmp as
geodist(t1.user_x, t1.user_y, t2.agent_x, t2.agent_y) as distance
users t1, agents t2
where calculated distance < 25;
proc sort in=tmp out=tmp;
by user_id desending distance;
proc sort in=tmp out=want nodupkey;
04-16-2013 10:19 AM
Concurred with , No silver bullet. Brute force of Cartesian product is the only force you can rely on according to your description. However, in term of performance, many have found that Hash() may have an edge over Proc SQL.
In the following code, smaller table ( presumably Agent) is loaded into Hash(), and you are using Geodist() to get the distance. I have kept all the intermediary variables for your reference, drop or keep them as you will. Try it and let us know if you find the performance is manageable.
if _n_=1 then do;
if 0 then set agent;
declare hash agt(dataset:'agent');
declare hiter hagt('agt');
do rc=hagt.first() by 0 while (rc=0);
_dis=geodist(User_X, User_Y, Agent_x, Agent_y);
if distance >=_dis then do;
05-10-2013 04:04 AM
Thank you both for your help. I used PROC SQL to do it in the end because the hash merge process confused me a bit. It is probably something I should look into and learn how to do.