Help using Base SAS procedures

Fastest way to execute

Accepted Solution Solved
Reply
Contributor
Posts: 27
Accepted Solution

Fastest way to execute

I frequently get requests for records we store in a large sas file. The table has approx 3 million records and 500 fields (about 10gb). The records I need I import into in a small table of approx 2000-3000 records, with only one field. Using PROC SQL is unacceptably slow, this would take probably a fraction of a second on a sql server. Instead SAS takes 12-15 minutes. How can I speed this up?

proc sql;

create table results as

select a.*

from

large_table a

left join

small_table b

on a.record_id = b.record_id

where b.record_id ne '';

quit;

If I have a smaller number of records, generally I'll just put the records in a where condition and load the file into memory (I have 16gb ram) via SASFILE, if there are only a couple hundred records, I get results in maybe 1-2 seconds.

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in ('0000000001','0000000002');

quit;

This method works great!! But there's a limit to how many records I can put in the where statement. I'd like to find a method that works as fast, using a selection criteria based on the contents of a small sas table.

Thanks


Accepted Solutions
Solution
‎03-24-2015 10:35 AM
Respected Advisor
Posts: 3,156

Re: Fastest way to execute

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

View solution in original post


All Replies
Valued Guide
Posts: 860

Re: Fastest way to execute

If the second code works quickly I am guessing that the large table is sorted/indexed by ID, for the first piece of code make sure the small table is also sorted/indexed by ID.  You should put the small table first in your code, the first table you reference will be put into your memory.  Use the smaller indexed table first and it might improve performance.

Contributor tlk
Contributor
Posts: 54

Re: Fastest way to execute

Posted in reply to Steelers_In_DC

Why use a left join, an inner join would do it without having to put a where statement.

Contributor
Posts: 27

Re: Fastest way to execute

Posted in reply to Steelers_In_DC

Indexing by ID and putting the small table first reduces time from 20 minutes (I looked back at the log for actual results, I waaay underestimated earlier. Really 20 minutes!!!) to 12 minutes. I'm testing some of the other ideas.

Solution
‎03-24-2015 10:35 AM
Respected Advisor
Posts: 3,156

Re: Fastest way to execute

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

Super User
Posts: 5,516

Re: Fastest way to execute

Another approach if the second method worked for you ...

A single macro variable is plenty large enough to hold 3000 quoted IDs.  Why not automate that process:

proc sql noprint;

   select "'" || strip(record_id) || "'" into : macro_var separated by ',' from smaller_table;

quit;

Then use the macro variable:

proc sql;

create table results as

select *

from large_table

where record_id in (&macro_var);

quit;

It still has to be tested for timing, since searching through a list of 3,000 will take longer than searching through a list of 3.  But any solution based on a single table is likely to run faster.

Good luck.

Contributor
Posts: 27

Re: Fastest way to execute

OK, we have a winner Hai.kuo, answer #1. I should have thought to use a subquery, I guess I didn't know how it would execute with sasfile. The following (once sasfile loads) executes in 4.1 seconds:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 6 replies
  • 297 views
  • 3 likes
  • 5 in conversation