BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
rwnj
Calcite | Level 5

I frequently get requests for records we store in a large sas file. The table has approx 3 million records and 500 fields (about 10gb). The records I need I import into in a small table of approx 2000-3000 records, with only one field. Using PROC SQL is unacceptably slow, this would take probably a fraction of a second on a sql server. Instead SAS takes 12-15 minutes. How can I speed this up?

proc sql;

create table results as

select a.*

from

large_table a

left join

small_table b

on a.record_id = b.record_id

where b.record_id ne '';

quit;

If I have a smaller number of records, generally I'll just put the records in a where condition and load the file into memory (I have 16gb ram) via SASFILE, if there are only a couple hundred records, I get results in maybe 1-2 seconds.

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in ('0000000001','0000000002');

quit;

This method works great!! But there's a limit to how many records I can put in the where statement. I'd like to find a method that works as fast, using a selection criteria based on the contents of a small sas table.

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions
Haikuo
Onyx | Level 15

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

View solution in original post

6 REPLIES 6
Steelers_In_DC
Barite | Level 11

If the second code works quickly I am guessing that the large table is sorted/indexed by ID, for the first piece of code make sure the small table is also sorted/indexed by ID.  You should put the small table first in your code, the first table you reference will be put into your memory.  Use the smaller indexed table first and it might improve performance.

tlk
Quartz | Level 8 tlk
Quartz | Level 8

Why use a left join, an inner join would do it without having to put a where statement.

rwnj
Calcite | Level 5

Indexing by ID and putting the small table first reduces time from 20 minutes (I looked back at the log for actual results, I waaay underestimated earlier. Really 20 minutes!!!) to 12 minutes. I'm testing some of the other ideas.

Haikuo
Onyx | Level 15

1. If your second solution worked well for you, then why not stick to it and using a subquery to accommodate your small table:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

2. You could try Hash table by putting the small table in the Hash:

data want;

if _n_=1 then do;

declare hash h(dataset:'small_table');

h.definekey('record_id');

h.definedone();

end;

set large_table;

if h.check()=0;

run;

3. If both of your tables are presorted by 'record_id', merge can also be an option:

data want;

merge small(in=s) big(in=b);

by record_id;

if a and b;

run;

BTW, I am puzzled by your first solution, isn't it to be 'right join' in this case?

Haikuo

Astounding
PROC Star

Another approach if the second method worked for you ...

A single macro variable is plenty large enough to hold 3000 quoted IDs.  Why not automate that process:

proc sql noprint;

   select "'" || strip(record_id) || "'" into : macro_var separated by ',' from smaller_table;

quit;

Then use the macro variable:

proc sql;

create table results as

select *

from large_table

where record_id in (&macro_var);

quit;

It still has to be tested for timing, since searching through a list of 3,000 will take longer than searching through a list of 3.  But any solution based on a single table is likely to run faster.

Good luck.

rwnj
Calcite | Level 5

OK, we have a winner Hai.kuo, answer #1. I should have thought to use a subquery, I guess I didn't know how it would execute with sasfile. The following (once sasfile loads) executes in 4.1 seconds:

sasfile.large_table load;

proc sql;

create table results as

select *

from large_table

where record_id in (select record_id from small_table);

quit;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1281 views
  • 3 likes
  • 5 in conversation