I have a list of of about 20,000 ids (length $20.). There is a SQL Server database table on a server somewhere with about 2 billion records. I need to get all the records from that table for my list of ids. I work on a 64-bit PC and connect to the database with an OLEDB libname A. I tried the following code:
proc sql;
create table NEW as select * from id_list i, A.bigTable t where i.id = t.id;
quit;
This took forever to run and failed because my server connection timed out at 8 hours. So I added a variable to id_list (obsnum=_n_;) and modified the syntax to:
create table NEW as select * from id_list (where=(1 le obsnum le 1000)) i, A.bigTable t where i.id = t.id;
This took 12 seconds (real time) to run. So I wrote a macro to do this and ran it 20 times, giving it a new thousand-id range each time.
Then I thought, why not try 2000 ids at a time, instead of 1000. 30 minutes later, this query was not complete so I stopped it.
My question is: what accounts for the huge increase in time when I try to get records for more than 1000 ids at a time? Is there a better method to go against a very large table to find records?
How you connect may be an issue but one of the first things is only select the variables you need.
The selection may be quicker as pass-through as well.
Is it possible that you have missing IDs after first 1000 IDs?
Can you create a table in the SQL database to store the list?
Otherwise you might be better off using the SAS dataset to generate WHERE clauses against the large table.
Here is an example. You might have to adjust if your ID variable is character or if it is too long to fit 1000 values into a single macro variable.
proc sql noprint ;
select distinct id into :idlist1 separated by ',' from id_list (firstobs=1 obs=1000) ;
select distinct id into :idlist2 separated by ',' from id_list (firstobs=1001 obs=2000) ;
create table NEW as select * from A.BIGTABLE
where id in (&idlist1,&idlist2)
;
quit;
You'd better create a INDEX for that prime key variable for a large table .and try to use PASS-Through SQL ,let SQL Server to do such heavy work.
Is in your libname pointing to the SQL Server option "MULTI_DATASRC_OPT=IN_CLAUSE" defined?
Use below option in your code to see in the SAS log what code actually gets sent to the dbms
OPTIONS SASTRACE=',,,d' SASTRACELOC=SASLOG NOSTSUFFIX;
Is the code executed for 1000 and 2000 obs at a time the same? Just thinking that may-be you're exceeding with 2000 id's some threshold where SAS needs to generate different SQL code.
My guess is that you are moving the whole source table to SAS for the join.
Try the DBKEY= ds option for the cross platform join.
If that doesn't work for you, consider to use a temporary table in SQLS so you are sure that the whole join will take place where the data is.
As said, be sure that the id column is indexed.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.