I have a file with variables, ID, latitude and longitude, and with about 50,000 records.
I want to calculate how many people are located within 10km of each person. The following code works, but is slow. It seems to be CPU rather than I/O bound (ie it is running one core at 100% on a windows server 2008 machine). (Running SAS 9.3 on x64). Any suggestions for how to optimize?
proc sql;
create table dist as
select id1, n(id2) as N10
from test (rename=(latitude=lat1 longitude=long1 id=id1)),
test (rename=(latitude=lat2 longitude=long2 id=id2))
where geodist(lat1,long1,lat2,long2)<10
group by id1 ;
quit;
You can try it. But I am not sure whether this code will be faster.
NOT TESTED
proc sql; create table want as select a.id as id,(select count(*) from test as b where geodist(a.lat,a.long,b.lat,b.long) < 10) as n10 from test as a;quit;
Ksharp
I think it's hard to optimize this one and keep it as one sql.
If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.
Then doing cartesian joins will give smaller overhead.
Linus
Yes. See
LinusH wrote:
I think it's hard to optimize this one and keep it as one sql.
If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.
Then doing cartesian joins will give smaller overhead.
Linus
I think this might be quicker.
proc sql;
create table dist as
select test1.id,
count(distinct test2.id) as some_number,
from
(select * from test) test1
cross join
(select * from test) test2
where geodist(test1.latitude, test1.longitude,
test2.latitude, test2.) <= 10
group by test1.id
;
quit;
run;
Partition your data
the simple criteria would be to ignore any pair where the X coords or Y coords are more than 10k apart
that would reduce the geodist calls
Partitioning is a good solution.
Otherwise (or rather, also) to jolt other ideas:
- the first SQL takes 16s,
- the second takes 8s (remove obvious latitude mismatches, you might gain more by working on longitude segments)
- the data step takes 2s
on my PC.
data TEST;
keep ID LATITUDE LONGITUDE;
do I=1 to 50;
do J=1 to 50;
ID =I*1e6+J;
LATITUDE =180*ranuni(0)-90;
LONGITUDE=360*ranuni(0)-180;
output;
end;
end;
run;
proc sql;
create table DIST1 as
select ID1, n( ID2) as N10
from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),
TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))
where 0< geodist(LA1,LO1,LA2,LO2)<100
group by ID1 ;
quit;
proc sql;
create table DIST2 as
select ID1, n( ID2) as N10
from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),
TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))
where 0 < abs(LA1 -LA2 ) < .91
and 0< geodist(LA1,LO1,LA2,LO2)<100
group by ID1 ;
quit;
data DIST3;
if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;
DSID1=open('TEST');
DSID2=open('TEST');
do I=1 to NOBS;
RC =fetch (DSID1);
ID1 =getvarn(DSID1,1);
LA1 =getvarn(DSID1,2);
LO1 =getvarn(DSID1,3);
N10=0;
do J=1 to NOBS;
RC =fetch (DSID2);
ID2 =getvarn(DSID2,1);
LA2 =getvarn(DSID2,2);
LO2 =getvarn(DSID2,3);
if 0 < abs(LA1 -LA2 ) < .91 then if 0< geodist(LA1,LO1,LA2,LO2)<100 then N10+1;
end;
if N10 then output;
RC =rewind(DSID2);
end;
RC =close(DSID1);
RC =close(DSID2);
keep ID1 N10;
stop;
run;
The data step seems to run a bit faster if I load TEST in memory using the SASFILE statement. Maybe 15% faster.
In case you are wondering why the OP hasn't responded - my original query was 4 years ago, and I don't have the data handy now to check your strategies. However, please don't let that stop you :smileygrin:
Oops, I didn't see that this thread had been resurrected..
In case this issue arises again for someone, there are probably ways to speed up the code further, for eg by using call set instead of getvarn(), or by using arrays.
RIP old thread...
Sorry, I couldn't resist and I had to try!
call set brings significant benefits, but the last benchmark, using arrays, is 50 ( fifty ! ) times faster than the basic SQL cartesian product.
The full test with 50,000 observations takes less than a minute (This was run on a slightly faster PC).
For more reading about SAS programming for speed, see my book:
High-Performance SAS Coding: Christian Graffeuille: 9781512397499: Amazon.com: Books
High-Performance SAS Coding: Polychrome: Christian Graffeuille: 9781514362310: Amazon.com: Books
Very impressive. I guess this could also be done in proc iml with a similar performance. Maybe the code would be even neater?
I can't see how the code could be any neater ...
Please accept my most humble apologies. Your code is indeed very neat . However, IML would not need the code setting up the arrays. (But I'm too lazy to write it).
No need to apologise, my comment was tongue in cheek.
I had to try a a couple more things, now on the full data.
1- 10 seconds saving when sorting by latitude and stopping the matching attempts as soon as values are too high.
2- Partitioning divides times further. Here, times are divided by 5 by creating 6 partitions; down to 10s.
I tried proc DS2 as this process is very suitable for parallelisation, but proc DS2 seems to only process observations in parallel, not do loop partitions. So not suitable here, unless someone better at proc ds2 than me can show us how it's done.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.