Re: SQL Cartesian product optimization

BruceBrad · Posted 11-21-2011 12:53 AM

I have a file with variables, ID, latitude and longitude, and with about 50,000 records.

I want to calculate how many people are located within 10km of each person. The following code works, but is slow. It seems to be CPU rather than I/O bound (ie it is running one core at 100% on a windows server 2008 machine). (Running SAS 9.3 on x64). Any suggestions for how to optimize?

proc sql;

create table dist as

select id1, n(id2) as N10

from test (rename=(latitude=lat1 longitude=long1 id=id1)),

test (rename=(latitude=lat2 longitude=long2 id=id2))

where geodist(lat1,long1,lat2,long2)<10

group by id1 ;

quit;

Ksharp · Posted 11-21-2011 03:11 AM

You can try it. But I am not sure whether this code will be faster.

NOT TESTED

proc sql;
create table want as
select a.id as id,(select count(*) from test as b where geodist(a.lat,a.long,b.lat,b.long) < 10) as n10
 from test as a;quit;

Ksharp

LinusH · Posted 11-21-2011 03:22 AM

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

Data never sleeps

Howles · Posted 11-21-2011 09:27 AM

Yes. See

LinusH wrote:

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

trubee · Posted 07-01-2015 02:24 PM

I think this might be quicker.

proc sql;

create table dist as

select test1.id,

count(distinct test2.id) as some_number,

from

(select * from test) test1

cross join

(select * from test) test2

where geodist(test1.latitude, test1.longitude,

test2.latitude, test2.) <= 10

group by test1.id

;

quit;

run;

Peter_C · Posted 07-04-2015 03:24 AM

Partition your data

the simple criteria would be to ignore any pair where the X coords or Y coords are more than 10k apart

that would reduce the geodist calls

ChrisNZ · Posted 07-06-2015 07:59 AM

Partitioning is a good solution.

Otherwise (or rather, also) to jolt other ideas:

- the first SQL takes 16s,

- the second takes 8s (remove obvious latitude mismatches, you might gain more by working on longitude segments)

- the data step takes 2s

on my PC.

data TEST;

keep ID LATITUDE LONGITUDE;

do I=1 to 50;

do J=1 to 50;

ID =I*1e6+J;

LATITUDE =180*ranuni(0)-90;

LONGITUDE=360*ranuni(0)-180;

output;

end;

run;

proc sql;

create table DIST1 as

select ID1, n( ID2) as N10

from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

where 0< geodist(LA1,LO1,LA2,LO2)<100

group by ID1 ;

quit;

proc sql;

create table DIST2 as

select ID1, n( ID2) as N10

from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

where 0 < abs(LA1 -LA2 ) < .91

and 0< geodist(LA1,LO1,LA2,LO2)<100

group by ID1 ;

quit;

data DIST3;

if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;

DSID1=open('TEST');

DSID2=open('TEST');

do I=1 to NOBS;

RC =fetch (DSID1);

ID1 =getvarn(DSID1,1);

LA1 =getvarn(DSID1,2);

LO1 =getvarn(DSID1,3);

N10=0;

do J=1 to NOBS;

RC =fetch (DSID2);

ID2 =getvarn(DSID2,1);

LA2 =getvarn(DSID2,2);

LO2 =getvarn(DSID2,3);

if 0 < abs(LA1 -LA2 ) < .91 then if 0< geodist(LA1,LO1,LA2,LO2)<100 then N10+1;

end;

if N10 then output;

RC =rewind(DSID2);

end;

RC =close(DSID1);

RC =close(DSID2);

keep ID1 N10;

stop;

run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 07-06-2015 08:14 AM

The data step seems to run a bit faster if I load TEST in memory using the SASFILE statement. Maybe 15% faster.

High-Performance SAS Coding - Third Edition

BruceBrad · Posted 07-06-2015 08:22 AM

In case you are wondering why the OP hasn't responded - my original query was 4 years ago, and I don't have the data handy now to check your strategies. However, please don't let that stop you :smileygrin:

ChrisNZ · Posted 07-06-2015 08:35 AM

Oops, I didn't see that this thread had been resurrected..

In case this issue arises again for someone, there are probably ways to speed up the code further, for eg by using call set instead of getvarn(), or by using arrays.

RIP old thread...

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 07-07-2015 03:06 AM

Sorry, I couldn't resist and I had to try!

call set brings significant benefits, but the last benchmark, using arrays, is 50 ( fifty ! ) times faster than the basic SQL cartesian product.

The full test with 50,000 observations takes less than a minute (This was run on a slightly faster PC).

data TEST;
do ID=1 to 5000;
    LATITUDE =180*ranuni(0)-90;
    LONGITUDE=360*ranuni(0)-180;
    output;
end;
run;

%********** SQL cartesian product: 25s ********;
proc sql;
create table DIST1 as
select ID1, n(ID2) as N10
from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),
       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))
where 0 < geodist(LA1,LO1,LA2,LO2) < 10
group by ID1 ;
quit;

%********** SQL with latitude filter: 13s ********;
proc sql;
create table DIST2 as
select ID1, n(ID2) as N10
from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),
       TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))
where 0 < abs(LA1 -LA2) < .091
    and 0 < geodist(LA1,LO1,LA2,LO2) < 10
group by ID1 ;
quit;

%********** Cartesian data step: 5s   **************    ;
data DIST3;
if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;
DSID1=open('TEST');
DSID2=open('TEST');
   do I=1 to NOBS;
    RC =fetch (DSID1);
    ID1 =getvarn(DSID1,1);
    LA1 =getvarn(DSID1,2);
    LO1 =getvarn(DSID1,3);
    N10=0;
    do J=1 to NOBS;
      RC =fetch (DSID2);
      ID2 =getvarn(DSID2,1);
      LA2 =getvarn(DSID2,2);
      LO2 =getvarn(DSID2,3);
      if 0 < abs(LA1 -LA2) < .091 then
        if 0<geodist(LA1,LO1,LA2,LO2)<10 then N10+1;
    end;
    if N10 then output;
    RC =rewind(DSID2);
end;
RC =close(DSID1);
RC =close(DSID2);
keep ID1 N10;
stop;
run;

%********** Data step + call set: 3.5s **************    ;
data TEST1; set TEST(rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)); run;
data TEST2; set TEST(rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2)); run;
data DIST4;
if 0 then set TEST1 ;
if 0 then set TEST2 nobs=NOBS;
DSID1=open('TEST1');
DSID2=open('TEST2');
call set(DSID1);
call set(DSID2);
do I=1 to NOBS;
    RC =fetch(DSID1);
    N10=0;
    do J=1 to NOBS;
      RC =fetch(DSID2);
      if 0 < abs(LA1 -LA2) < .091 then
        if 0<geodist(LA1,LO1,LA2,LO2)<10 then N10+1;
    end;
    if N10 then output;
    RC =rewind(DSID2);
end;
RC =close(DSID1);
RC =close(DSID2);
keep ID1 N10;
stop;
run;

%********** Data step & array : 0.6s **************    ;
%let nobs=5000;
data DIST5;
array IDS [&nobs.] _temporary_;
array LAT [&nobs.] _temporary_;
array LON [&nobs.] _temporary_;
do I=1 to &nobs.;
    set TEST;
    IDS =ID;
    LAT =LATITUDE;
    LON =LONGITUDE;
end;
do I=1 to &nobs.;
    do J=1 to &nobs.;
      if 0 < abs(LAT-LAT) < .091 then
        if 0<geodist(LAT,LON,LAT,LON)<10 then N10+1;
    end;
    if N10 then do;
      ID1=IDS;
      output;
      N10=0;
   end;
end;
keep ID1 N10;
stop;
run;

For more reading about SAS programming for speed, see my book:

High-Performance SAS Coding: Christian Graffeuille: 9781512397499: Amazon.com: Books

High-Performance SAS Coding: Polychrome: Christian Graffeuille: 9781514362310: Amazon.com: Books

High-Performance SAS Coding - Third Edition

BruceBrad · Posted 07-07-2015 09:45 PM

Very impressive. I guess this could also be done in proc iml with a similar performance. Maybe the code would be even neater?

ChrisNZ · Posted 07-08-2015 07:18 AM

I can't see how the code could be any neater ...

High-Performance SAS Coding - Third Edition

BruceBrad · Posted 07-08-2015 07:53 AM

Please accept my most humble apologies. Your code is indeed very neat . However, IML would not need the code setting up the arrays. (But I'm too lazy to write it).

ChrisNZ · Posted 07-09-2015 02:49 AM

No need to apologise, my comment was tongue in cheek.

I had to try a a couple more things, now on the full data.

1- 10 seconds saving when sorting by latitude and stopping the matching attempts as soon as values are too high.

2- Partitioning divides times further. Here, times are divided by 5 by creating 6 partitions; down to 10s.

I tried proc DS2 as this process is very suitable for parallelisation, but proc DS2 seems to only process observations in parallel, not do loop partitions. So not suitable here, unless someone better at proc ds2 than me can show us how it's done.

data TEST;
do ID=1 to 50000;
    LATITUDE =180*ranuni(0)-90;
    LONGITUDE=360*ranuni(0)-180;
    output;
end;
run;

%********** Data step & array : 60s **************    ;
%let nobs=50000;
data DIST5;
array IDS [&nobs.] _temporary_;
array LAT [&nobs.] _temporary_;
array LON [&nobs.] _temporary_;
do I=1 to &nobs.;
    set TEST;
    IDS =ID;
    LAT =LATITUDE;
    LON =LONGITUDE;
end;
do I=1 to &nobs.;
    do J=1 to &nobs.;
      if 0 < abs(LAT-LAT) < .091 then
        if 0<geodist(LAT,LON,LAT,LON)<10 then N10+1;
    end;
    if N10 then do;
      ID1=IDS;
      output;
      N10=0;
   end;
end;
keep ID1 N10;
stop;
run ;

%********** Data step & array, sorted : 50s **************    ;
proc sort data=TEST;
by LATITUDE;
data DIST6;
array IDS [&nobs.] _temporary_;
array LAT [&nobs.] _temporary_;
array LON [&nobs.] _temporary_;
do I=1 to &nobs.;
    set TEST;
    IDS =ID;
    LAT =LATITUDE;
    LON =LONGITUDE;
end;
do I=1 to &nobs.;
    do J=1 to &nobs.;
      if LAT > LAT+.091 then leave;    %* stop: the distance will now keep growing;
      if 0 < abs(LAT-LAT) < .091 then
        if 0<geodist(LAT,LON,LAT,LON)<10 then N10+1;
    end;
    if N10 then do;
      ID1=IDS;
      output;
      N10=0;
   end;
end;
keep ID1 N10;
stop;
run ;
proc sort data=DIST6;
by ID1;
run;

%********** Do 6 smaller cartesian products: 10s **************    ;
proc sort data=TEST;
by LATITUDE;
data DIST7;
array IDS [&nobs.] _temporary_;
array LAT [&nobs.] _temporary_;
array LON [&nobs.] _temporary_;
array START [6]    _temporary_;      %* Make six 30° latitude partitions;
array STOP [6]    _temporary_;
do I=1 to &nobs.;
    set TEST;
    IDS =ID;
    LAT =LATITUDE;
    LON =LONGITUDE;
    if 1               then START[1]=1; %* Save partition boundaries       ;
    if LATITUDE<=-59.8 then STOP [1]=I; %* Partitions overlap by 0.2° ~22km;
    if LATITUDE<=-60   then START[2]=I;
    if LATITUDE<=-29.8 then STOP [2]=I;
    if LATITUDE<=-30   then START[3]=I;
    if LATITUDE<= 0.2 then STOP [3]=I;
    if LATITUDE<= 0   then START[4]=I;
    if LATITUDE<= 30.2 then STOP [4]=I;
    if LATITUDE<= 30   then START[5]=I;
    if LATITUDE<= 60.2 then STOP [5]=I;
    if LATITUDE<= 60   then START[6]=I;
    if 1               then STOP [6]=I;
end;
do PARTITION=1 to 6;
    do I=START[PARTITION] to STOP[PARTITION];
      do J=START[PARTITION] to STOP[PARTITION];
        if LAT > LAT+.091 then leave;
        if 0 < abs(LAT-LAT) < .091 then
          if 0<geodist(LAT,LON,LAT,LON)<10 then N10+1;
      end;
      if N10 then do;
        ID1=IDS;
        output;
        N10=0;
      end;
    end;
end;
keep ID1 N10;
stop;
run;
proc sql;                          %* Some points at the partition boundaries ;
create table DIST7A as           %* will have incomplete matches.          ;
select ID1, max(N10) as N10      %* Keep full matches.                      ;
from DIST7
group by ID1
order by ID1;
quit;

High-Performance SAS Coding - Third Edition

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away