Solved: How to speed up proc sql when joining a large datset to a small datase...

metuzalem · Posted 02-27-2018 02:51 AM

I am running the following sql-step:

proc sql;

create table id_map as select

a.securityId

,b.name

,b.companyId

from eq_securityIds as a

left join FOUNDATION_SECURITY as b on a.securityId=b.securityId

;

quit;

where the dataset eq_securityId consists of ~16000 observations (here denoted a small dataset) and the dataset FOUNDATION_SECURITY consist of ~60 000 000 observations (here denoted a large dataset).

As of now the runtime to execute this command is about 10min, is there anything I can do to speed up the sql-step? e.g, is it in general faster to join a small dataset on to a large dataset instead of me joining the large dataset onto the small dataset? could some kind of indexation of the large dataset help me speeding up the computations?

I noted that the command ran a lot faster (~3min) when I excluded the variable b.name in the same sql-step, is this a general result or was it just a fluke?

I run SAS 9.4 (in enterprise guide 7.1)

LinusH · Posted 02-27-2018 05:00 AM

Indexing the large table is a good idea, especially if it's used in other use cases.

This would might trigger the hash join method in SQL, which is quite efficient (use _method PROC SQL option).

Data never sleeps

View solution in original post

Kurt_Bremser · Posted 02-27-2018 03:02 AM

Keep both datasets sorted by securityid, and use a data step merge for the selection of your subset.

Alternatively, create a format from eq_securityIds where all contained securityid keys get a label of 'yes' and the OTHER value a 'no'. Then you can use the format in a subsetting if in a data step.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

LinusH · Posted 02-27-2018 05:00 AM

Indexing the large table is a good idea, especially if it's used in other use cases.

This would might trigger the hash join method in SQL, which is quite efficient (use _method PROC SQL option).

Data never sleeps

metuzalem · Posted 02-27-2018 08:46 AM

Thx!

by running the command

proc sql;

create index securityId on FOUNDATION_SECURITY;

quit;

the sql-step is executed a lot faster. It takes about 2min to create the index but since I am running similar join-queries further down the in the program with the large dataset then this helped a lot. Thx!

Ksharp · Posted 02-27-2018 07:31 AM

1) As @LinusH said , make a index on the large table is a good choice.

create index securityid on BigTable;

2)Hash Table

How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

Re: How to speed up proc sql when joining a large datset to a small dataset?

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!