DATA Step, Macro, Functions and more

SQL Cartesian product optimization

Reply
Frequent Contributor
Posts: 129

SQL Cartesian product optimization

I have a file with variables, ID, latitude and longitude, and with about 50,000 records.

I want to calculate how many people are located within 10km of each person. The following code works, but is slow. It seems to be CPU rather than I/O bound (ie it is running one core at 100% on a windows server 2008 machine). (Running SAS 9.3 on x64). Any suggestions for how to optimize?

proc sql;

create table dist as

   select id1, n(id2) as N10

     from test (rename=(latitude=lat1 longitude=long1 id=id1)),

          test (rename=(latitude=lat2 longitude=long2 id=id2))

     where geodist(lat1,long1,lat2,long2)<10

     group by id1 ;

quit;

Super User
Posts: 10,035

SQL Cartesian product optimization

Posted in reply to BruceBrad

You can try it. But I am not sure whether this code will be faster.

NOT TESTED

proc sql;
create table want as
select a.id as id,(select count(*) from test as b where geodist(a.lat,a.long,b.lat,b.long) < 10) as n10
 from test as a;quit;

Ksharp

Super User
Posts: 5,430

SQL Cartesian product optimization

Posted in reply to BruceBrad

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

Data never sleeps
Regular Contributor
Posts: 184

Re: SQL Cartesian product optimization

Yes. See

LinusH wrote:

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

New User
Posts: 1

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

I think this might be quicker.

proc sql;

    create table dist as

        select test1.id,

        count(distinct test2.id) as some_number,

        from

        (select * from test) test1

        cross join

        (select * from test) test2

        where geodist(test1.latitude, test1.longitude,

                    test2.latitude, test2.) <= 10

        group by test1.id

        ;

    quit;

    run;

Valued Guide
Posts: 2,177

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

Partition your data

the simple criteria would be to ignore any pair where the X coords  or Y coords are more than 10k apart

that would reduce the geodist calls

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

Partitioning is a good solution.

Otherwise (or rather, also) to jolt other ideas:

- the first SQL takes 16s,

- the second takes 8s (remove obvious latitude mismatches, you might gain more by working on longitude segments)

- the data step takes 2s

on my PC.

data TEST;

  keep ID LATITUDE LONGITUDE;

  do I=1 to 50;

    do J=1 to 50;

      ID =I*1e6+J;

      LATITUDE =180*ranuni(0)-90;

      LONGITUDE=360*ranuni(0)-180;

      output;

    end;

  end;

run;

proc sql;

  create table DIST1 as

  select ID1, n( ID2) as N10

  from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

  where 0< geodist(LA1,LO1,LA2,LO2)<100

  group by ID1 ;

quit;

proc sql;

  create table DIST2 as

  select ID1, n( ID2) as N10

  from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

  where 0 < abs(LA1 -LA2 ) < .91

    and 0< geodist(LA1,LO1,LA2,LO2)<100

  group by ID1 ;

quit;

data DIST3;

   if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;

  DSID1=open('TEST');         

  DSID2=open('TEST');         

   do I=1 to NOBS;

    RC =fetch  (DSID1);

    ID1 =getvarn(DSID1,1);

    LA1 =getvarn(DSID1,2);

    LO1 =getvarn(DSID1,3);

    N10=0;

do J=1 to NOBS;

      RC =fetch  (DSID2);

      ID2 =getvarn(DSID2,1);

      LA2 =getvarn(DSID2,2);

      LO2 =getvarn(DSID2,3);

      if 0 < abs(LA1 -LA2 ) < .91 then if 0< geodist(LA1,LO1,LA2,LO2)<100 then  N10+1;

    end;

    if N10 then output;

    RC =rewind(DSID2);

  end;

RC =close(DSID1);

RC =close(DSID2);

  keep ID1 N10;

  stop;

run;

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

The data step seems to run a bit faster if I load TEST in memory using the SASFILE statement. Maybe 15% faster.

Frequent Contributor
Posts: 129

Re: SQL Cartesian product optimization

In case you are wondering why the OP hasn't responded - my original query was 4 years ago, and I don't have the data handy now to check your strategies. However, please don't let that stop you :smileygrin:

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

Oops, I didn't see that this thread had been resurrected.. Smiley Happy

In case this issue arises again for someone, there are probably ways to speed up the code further, for eg by using call set instead of getvarn(), or by using arrays.

RIP old thread...

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

Sorry, I couldn't resist and I had to try!

call set brings significant benefits, but the last benchmark, using arrays, is  50 ( fifty ! ) times faster than the basic SQL cartesian product.

The full test with 50,000 observations takes less than a minute (This was run on a slightly faster PC).

For more reading about SAS programming for speed, see my book:

High-Performance SAS Coding: Christian Graffeuille: 9781512397499: Amazon.com: Books

High-Performance SAS Coding: Polychrome: Christian Graffeuille: 9781514362310: Amazon.com: Books

Smiley Happy

Frequent Contributor
Posts: 129

Re: SQL Cartesian product optimization

Very impressive. I guess this could also be done in proc iml with a similar performance. Maybe the code would be even neater?

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

I can't see how the code could be any neater ...  Smiley Wink

Frequent Contributor
Posts: 129

Re: SQL Cartesian product optimization

Please accept my most humble apologies. Your code is indeed very neat Smiley Happy.  However, IML would not need the code setting up the arrays. (But I'm too lazy to write it).

PROC Star
Posts: 1,760

Re: SQL Cartesian product optimization

Posted in reply to BruceBrad

No need to apologise, my comment was tongue in cheek. Smiley Happy

I had to try a a couple more things, now on the full data.

1- 10 seconds saving when sorting by latitude and stopping the matching attempts as soon as values are too high.

2- Partitioning divides times further. Here, times are divided by 5 by creating 6 partitions; down to 10s.

I tried proc DS2 as this process is very suitable for parallelisation, but proc DS2 seems to only process observations in parallel, not do loop partitions. So not suitable here, unless someone better at proc ds2 than me can show us how it's done.

Ask a Question
Discussion stats
  • 14 replies
  • 1238 views
  • 2 likes
  • 7 in conversation