BookmarkSubscribeRSS Feed
BruceBrad
Lapis Lazuli | Level 10

I have a file with variables, ID, latitude and longitude, and with about 50,000 records.

I want to calculate how many people are located within 10km of each person. The following code works, but is slow. It seems to be CPU rather than I/O bound (ie it is running one core at 100% on a windows server 2008 machine). (Running SAS 9.3 on x64). Any suggestions for how to optimize?

proc sql;

create table dist as

   select id1, n(id2) as N10

     from test (rename=(latitude=lat1 longitude=long1 id=id1)),

          test (rename=(latitude=lat2 longitude=long2 id=id2))

     where geodist(lat1,long1,lat2,long2)<10

     group by id1 ;

quit;

14 REPLIES 14
Ksharp
Super User

You can try it. But I am not sure whether this code will be faster.

NOT TESTED

proc sql;
create table want as
select a.id as id,(select count(*) from test as b where geodist(a.lat,a.long,b.lat,b.long) < 10) as n10
 from test as a;quit;

Ksharp

LinusH
Tourmaline | Level 20

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

Data never sleeps
Howles
Quartz | Level 8

Yes. See

LinusH wrote:

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

trubee
Calcite | Level 5

I think this might be quicker.

proc sql;

    create table dist as

        select test1.id,

        count(distinct test2.id) as some_number,

        from

        (select * from test) test1

        cross join

        (select * from test) test2

        where geodist(test1.latitude, test1.longitude,

                    test2.latitude, test2.) <= 10

        group by test1.id

        ;

    quit;

    run;

Peter_C
Rhodochrosite | Level 12

Partition your data

the simple criteria would be to ignore any pair where the X coords  or Y coords are more than 10k apart

that would reduce the geodist calls

ChrisNZ
Tourmaline | Level 20

Partitioning is a good solution.

Otherwise (or rather, also) to jolt other ideas:

- the first SQL takes 16s,

- the second takes 8s (remove obvious latitude mismatches, you might gain more by working on longitude segments)

- the data step takes 2s

on my PC.

data TEST;

  keep ID LATITUDE LONGITUDE;

  do I=1 to 50;

    do J=1 to 50;

      ID =I*1e6+J;

      LATITUDE =180*ranuni(0)-90;

      LONGITUDE=360*ranuni(0)-180;

      output;

    end;

  end;

run;

proc sql;

  create table DIST1 as

  select ID1, n( ID2) as N10

  from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

  where 0< geodist(LA1,LO1,LA2,LO2)<100

  group by ID1 ;

quit;

proc sql;

  create table DIST2 as

  select ID1, n( ID2) as N10

  from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

  where 0 < abs(LA1 -LA2 ) < .91

    and 0< geodist(LA1,LO1,LA2,LO2)<100

  group by ID1 ;

quit;

data DIST3;

   if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;

  DSID1=open('TEST');         

  DSID2=open('TEST');         

   do I=1 to NOBS;

    RC =fetch  (DSID1);

    ID1 =getvarn(DSID1,1);

    LA1 =getvarn(DSID1,2);

    LO1 =getvarn(DSID1,3);

    N10=0;

do J=1 to NOBS;

      RC =fetch  (DSID2);

      ID2 =getvarn(DSID2,1);

      LA2 =getvarn(DSID2,2);

      LO2 =getvarn(DSID2,3);

      if 0 < abs(LA1 -LA2 ) < .91 then if 0< geodist(LA1,LO1,LA2,LO2)<100 then  N10+1;

    end;

    if N10 then output;

    RC =rewind(DSID2);

  end;

RC =close(DSID1);

RC =close(DSID2);

  keep ID1 N10;

  stop;

run;

ChrisNZ
Tourmaline | Level 20

The data step seems to run a bit faster if I load TEST in memory using the SASFILE statement. Maybe 15% faster.

BruceBrad
Lapis Lazuli | Level 10

In case you are wondering why the OP hasn't responded - my original query was 4 years ago, and I don't have the data handy now to check your strategies. However, please don't let that stop you :smileygrin:

ChrisNZ
Tourmaline | Level 20

Oops, I didn't see that this thread had been resurrected.. Smiley Happy

In case this issue arises again for someone, there are probably ways to speed up the code further, for eg by using call set instead of getvarn(), or by using arrays.

RIP old thread...

ChrisNZ
Tourmaline | Level 20

Sorry, I couldn't resist and I had to try!

call set brings significant benefits, but the last benchmark, using arrays, is  50 ( fifty ! ) times faster than the basic SQL cartesian product.

The full test with 50,000 observations takes less than a minute (This was run on a slightly faster PC).

For more reading about SAS programming for speed, see my book:

High-Performance SAS Coding: Christian Graffeuille: 9781512397499: Amazon.com: Books

High-Performance SAS Coding: Polychrome: Christian Graffeuille: 9781514362310: Amazon.com: Books

Smiley Happy

BruceBrad
Lapis Lazuli | Level 10

Very impressive. I guess this could also be done in proc iml with a similar performance. Maybe the code would be even neater?

ChrisNZ
Tourmaline | Level 20

I can't see how the code could be any neater ...  Smiley Wink

BruceBrad
Lapis Lazuli | Level 10

Please accept my most humble apologies. Your code is indeed very neat Smiley Happy.  However, IML would not need the code setting up the arrays. (But I'm too lazy to write it).

ChrisNZ
Tourmaline | Level 20

No need to apologise, my comment was tongue in cheek. Smiley Happy

I had to try a a couple more things, now on the full data.

1- 10 seconds saving when sorting by latitude and stopping the matching attempts as soon as values are too high.

2- Partitioning divides times further. Here, times are divided by 5 by creating 6 partitions; down to 10s.

I tried proc DS2 as this process is very suitable for parallelisation, but proc DS2 seems to only process observations in parallel, not do loop partitions. So not suitable here, unless someone better at proc ds2 than me can show us how it's done.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 3603 views
  • 2 likes
  • 7 in conversation