Solved: Re: Fastest way to execute

rwnj · Posted 03-24-2015 10:13 AM

I frequently get requests for records we store in a large sas file. The table has approx 3 million records and 500 fields (about 10gb). The records I need I import into in a small table of approx 2000-3000 records, with only one field. Using PROC SQL is unacceptably slow, this would take probably a fraction of a second on a sql server. Instead SAS takes 12-15 minutes. How can I speed this up?

proc sql;

create table results as

select a.*

from

large_table a

left join

small_table b

on a.record_id = b.record_id

where b.record_id ne '';

quit;

If I have a smaller number of records, generally I'll just put the records in a where condition and load the file into memory (I have 16gb ram) via SASFILE, if there are only a couple hundred records, I get results in maybe 1-2 seconds.

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in ('0000000001','0000000002');

quit;

This method works great!! But there's a limit to how many records I can put in the where statement. I'd like to find a method that works as fast, using a selection criteria based on the contents of a small sas table.

Thanks

Haikuo · Posted 03-24-2015 10:35 AM

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

View solution in original post

Steelers_In_DC · Posted 03-24-2015 10:19 AM

If the second code works quickly I am guessing that the large table is sorted/indexed by ID, for the first piece of code make sure the small table is also sorted/indexed by ID. You should put the small table first in your code, the first table you reference will be put into your memory. Use the smaller indexed table first and it might improve performance.

tlk · Posted 03-24-2015 10:25 AM

Why use a left join, an inner join would do it without having to put a where statement.

rwnj · Posted 03-24-2015 10:43 AM

Indexing by ID and putting the small table first reduces time from 20 minutes (I looked back at the log for actual results, I waaay underestimated earlier. Really 20 minutes!!!) to 12 minutes. I'm testing some of the other ideas.

Haikuo · Posted 03-24-2015 10:35 AM

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

Astounding · Posted 03-24-2015 10:48 AM

Another approach if the second method worked for you ...

A single macro variable is plenty large enough to hold 3000 quoted IDs. Why not automate that process:

proc sql noprint;

select "'" || strip(record_id) || "'" into : macro_var separated by ',' from smaller_table;

quit;

Then use the macro variable:

proc sql;

create table results as

select *

from large_table

where record_id in (&macro_var);

quit;

It still has to be tested for timing, since searching through a list of 3,000 will take longer than searching through a list of 3. But any solution based on a single table is likely to run faster.

Good luck.

rwnj · Posted 03-24-2015 10:49 AM

OK, we have a winner Hai.kuo, answer #1. I should have thought to use a subquery, I guess I didn't know how it would execute with sasfile. The following (once sasfile loads) executes in 4.1 seconds:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

Registration is open

SAS Training: Just a Click Away