## SQL Cartesian product optimization

Regular Contributor
Posts: 151

# SQL Cartesian product optimization

I have a file with variables, ID, latitude and longitude, and with about 50,000 records.

I want to calculate how many people are located within 10km of each person. The following code works, but is slow. It seems to be CPU rather than I/O bound (ie it is running one core at 100% on a windows server 2008 machine). (Running SAS 9.3 on x64). Any suggestions for how to optimize?

proc sql;

create table dist as

select id1, n(id2) as N10

from test (rename=(latitude=lat1 longitude=long1 id=id1)),

test (rename=(latitude=lat2 longitude=long2 id=id2))

where geodist(lat1,long1,lat2,long2)<10

group by id1 ;

quit;

Super User
Posts: 10,778

## SQL Cartesian product optimization

You can try it. But I am not sure whether this code will be faster.

NOT TESTED

```proc sql;
create table want as
select a.id as id,(select count(*) from test as b where geodist(a.lat,a.long,b.lat,b.long) < 10) as n10
from test as a;quit;

```

Ksharp

Super User
Posts: 5,879

## SQL Cartesian product optimization

I think it's hard to optimize this one and keep it as one sql.

If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.

Then doing cartesian joins will give smaller overhead.

Linus

Data never sleeps
Regular Contributor
Posts: 184

## Re: SQL Cartesian product optimization

Yes. See

`LinusH wrote:I think it's hard to optimize this one and keep it as one sql.If you can, try to split your data into (overlapping) smaller chunks of data, based on geographical position.Then doing cartesian joins will give smaller overhead.Linus`
New User
Posts: 1

## Re: SQL Cartesian product optimization

I think this might be quicker.

proc sql;

create table dist as

select test1.id,

count(distinct test2.id) as some_number,

from

(select * from test) test1

cross join

(select * from test) test2

where geodist(test1.latitude, test1.longitude,

test2.latitude, test2.) <= 10

group by test1.id

;

quit;

run;

Valued Guide
Posts: 2,191

## Re: SQL Cartesian product optimization

the simple criteria would be to ignore any pair where the X coords  or Y coords are more than 10k apart

that would reduce the geodist calls

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

Partitioning is a good solution.

Otherwise (or rather, also) to jolt other ideas:

- the first SQL takes 16s,

- the second takes 8s (remove obvious latitude mismatches, you might gain more by working on longitude segments)

- the data step takes 2s

on my PC.

data TEST;

keep ID LATITUDE LONGITUDE;

do I=1 to 50;

do J=1 to 50;

ID =I*1e6+J;

LATITUDE =180*ranuni(0)-90;

LONGITUDE=360*ranuni(0)-180;

output;

end;

end;

run;

proc sql;

create table DIST1 as

select ID1, n( ID2) as N10

from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

where 0< geodist(LA1,LO1,LA2,LO2)<100

group by ID1 ;

quit;

proc sql;

create table DIST2 as

select ID1, n( ID2) as N10

from TEST (rename=(LATITUDE=LA1 LONGITUDE=LO1 ID=ID1)),

TEST (rename=(LATITUDE=LA2 LONGITUDE=LO2 ID=ID2))

where 0 < abs(LA1 -LA2 ) < .91

and 0< geodist(LA1,LO1,LA2,LO2)<100

group by ID1 ;

quit;

data DIST3;

if 0 then set TEST(keep=ID rename=(ID=ID1)) nobs=NOBS;

DSID1=open('TEST');

DSID2=open('TEST');

do I=1 to NOBS;

RC =fetch  (DSID1);

ID1 =getvarn(DSID1,1);

LA1 =getvarn(DSID1,2);

LO1 =getvarn(DSID1,3);

N10=0;

do J=1 to NOBS;

RC =fetch  (DSID2);

ID2 =getvarn(DSID2,1);

LA2 =getvarn(DSID2,2);

LO2 =getvarn(DSID2,3);

if 0 < abs(LA1 -LA2 ) < .91 then if 0< geodist(LA1,LO1,LA2,LO2)<100 then  N10+1;

end;

if N10 then output;

RC =rewind(DSID2);

end;

RC =close(DSID1);

RC =close(DSID2);

keep ID1 N10;

stop;

run;

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

The data step seems to run a bit faster if I load TEST in memory using the SASFILE statement. Maybe 15% faster.

Regular Contributor
Posts: 151

## Re: SQL Cartesian product optimization

In case you are wondering why the OP hasn't responded - my original query was 4 years ago, and I don't have the data handy now to check your strategies. However, please don't let that stop you :smileygrin:

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

Oops, I didn't see that this thread had been resurrected..

In case this issue arises again for someone, there are probably ways to speed up the code further, for eg by using call set instead of getvarn(), or by using arrays.

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

Sorry, I couldn't resist and I had to try!

call set brings significant benefits, but the last benchmark, using arrays, is  50 ( fifty ! ) times faster than the basic SQL cartesian product.

The full test with 50,000 observations takes less than a minute (This was run on a slightly faster PC).

For more reading about SAS programming for speed, see my book:

High-Performance SAS Coding: Christian Graffeuille: 9781512397499: Amazon.com: Books

High-Performance SAS Coding: Polychrome: Christian Graffeuille: 9781514362310: Amazon.com: Books

Regular Contributor
Posts: 151

## Re: SQL Cartesian product optimization

Very impressive. I guess this could also be done in proc iml with a similar performance. Maybe the code would be even neater?

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

I can't see how the code could be any neater ...

Regular Contributor
Posts: 151

## Re: SQL Cartesian product optimization

Please accept my most humble apologies. Your code is indeed very neat .  However, IML would not need the code setting up the arrays. (But I'm too lazy to write it).

PROC Star
Posts: 2,363

## Re: SQL Cartesian product optimization

No need to apologise, my comment was tongue in cheek.

I had to try a a couple more things, now on the full data.

1- 10 seconds saving when sorting by latitude and stopping the matching attempts as soon as values are too high.

2- Partitioning divides times further. Here, times are divided by 5 by creating 6 partitions; down to 10s.

I tried proc DS2 as this process is very suitable for parallelisation, but proc DS2 seems to only process observations in parallel, not do loop partitions. So not suitable here, unless someone better at proc ds2 than me can show us how it's done.

Discussion stats
• 14 replies
• 1512 views
• 2 likes
• 7 in conversation