BookmarkSubscribeRSS Feed
saspert
Pyrite | Level 9

Hi,

Does anyone have any experience on running Proc sql on very large databases (say healthcare claims DB having 500m-1b records)? Any references to papers will also work.

And any advantage if the DB is teradata?

SAS is at 9.1.3

I am familiar with hash joins (used it once maybe) but am not a technical person to understand the theory behind it. Any guidance would be appreciated.

Thanks,
saspert.

14 REPLIES 14
FriedEgg
SAS Employee

I would consider tables of this size to be small.  I frequently interact with tables in SAS that are multiple terrabytes in size.  Tens or billions of rows and thousands of columns wide.  Your question is too broad to answer more plainly than: Proc SQL should have no problem working with your data, whether it is stored in sas, teradata or any number of other sources.

saspert
Pyrite | Level 9

Thank you - just thinking aloud - wouldnt the processing time take up to several hours for proc sql? I know hash join does the same thing in less time.

FriedEgg
SAS Employee

Could you provide a more specific example to what you are asking.  Depending on the complexity of the join and other conditions in a query, just because you are dealing with large tables doesn't mean a join will take a substantial amount of time, especially with the proper hardware to support such data.

Earlier today for example, I ran a query that joins together 12 sas tables of about 600million records each and it completed in about 5 hours.  For my scheduling, this is more than fast enough, and has a lot of room for improvement if needed.

It is all relative to what you are trying to accomplish...

FriedEgg
SAS Employee

Here is another thought specifically pertaining to hash joins and proc sql, in case I have previously missed what you are trying to ask.

proc sql with use a hash join method for inner joins between tables.  you can see this by adding the _method option to your proc sql statement.  However it will not use this method for other types of joins.  You can manually produce a hash join method using the hash object available to data steps.  I do not know how data being stored in an outside rdbms would effect this.

try running your query with the _method option and seeing what the explain plain says it is doing.

proc sql _method;

* your query here;

quit;

if you see 'sqxjhsh' then it is using a hash join. if you see 'sqxjm' then it is using a join-merge.

Patrick
Opal | Level 21

The typical approach in dealing with large tables in a DB would be to push execution to the database, reduce the volumes as much as possible and only then transfer the data to SAS for further execution (if this is needed at all).

If you use PROC SQL then the SAS/ACCESS engine will try to convert your SAS SQL syntax to DB SQL syntax and send as much as possible to the DB. Look up options SASTRACE and SASTRACELOC as turning on these options will show you in the log how much of your SAS SQL could be sent to the DB.

I wouldn't use a data step and SAS hash objects if you can avoid it as this means that most of the data needs to be transfered to SAS (a where clause in the set statement will still be sent to the DB and subsetting happens there).

You can also use pass-through SQL (there is even an option in EG4.2 which converts wizard generated SAS SQL into DB SQL in a pass-through block). Pass through SQL means basically that you define within PROC SQL syntax in the DB flavor - and this code is sent 1:1 to the DB.

One of the advantages of doing this is that you can give this code as-is to a DBA for optimisation - or even get a DBA to provide you with optimised code which you then just copy-paste into your SAS program.

And about Teradata: Thinking about all this parallel processing and magic going on in Teradata you should take advantage of it (meaning: have processing in Teradata).

HTH

Patrick

AndrewZ
Quartz | Level 8

As others have mentioned, the details are important, so here is an anecdote.  I have a PROC SQL query with two tables with about ~2m rows running on SAS 9.1.3 on a laptop, and I can reduce the time from 4-6 minutes to ~20 seconds (i.e., >90%) by removing DISTINCT from COUNT().  I get the same answer faster using PROC SORT NODUPKEY before PROC SQL.

Ksharp
Super User

That is just the advantage of using SAS, SQL is not suited for very large table. SQL is designed based on Cartesian Product which will generate incredible the number of join with increasing number of obs.

The only thing promoting speed is to make a index, but it is also spend lots of time to build index.

So For your situation (very large table), recommend to use data step or Hash Table to complete your task.

Ksharp

FriedEgg
SAS Employee

Indexing does no always provide an advantage.  In some of the larger table I deal with creating indexs actually hurt performance.  There isn't necessarily a rule of thumb to knowing, more of trial and error.

Also the vast majority of databases that I can think of do have hash joins, merge joins, and cartesian products joins available to them and as I mentioned earlier the default join mechanic in proc sql, for an inner join, is the hash method.

Ksharp
Super User

I know index is not always a good choice, especially for only querying a little obs from large table which

is not as fast as consecutive enter as SAS's set statement.

But SAS is smart enough to decide whether to use index automatically.

Using SAS code is faster than SQL because the principle they are designed is different.

So you will see at some case SQL is faster and at another some case SAS is faster especially for large table.

For some operation on large table, SQL  almost can not but SAS can ,that is the advantage of SAS than SQL.

If only consider join table, SAS is also faster than SQL for very large table.

Ksharp

saspert
Pyrite | Level 9

Thanks everyone for your opinions. I guess I will have to come up with a very specific scenario rather than a vague one. Hopefully, the assignment with the large tables will come my way to work on - I can elaborate more at that point.

Regards,
saspert.

Peter_C
Rhodochrosite | Level 12

be aware that teradata offers a unique set of optimisation options for which SAS is not always the best training ground. I expect that is why most teradata sql programmers need formal training before being "let loose" on full production volumes. Ask if there any teradata sql training courses available to you.

Pritish
Quartz | Level 8

Ksharp,

Based on what you said, Index are not always useful. Can you suggest me good way for finding Null values from a column? I am using SQL statement in SAS right now. Do you think, using SAS statements will provide me the result faster?

Ksharp
Super User

Hi.

If the result (i.e. finding Null values from a column) you need only occupy ten percent or less in a whole large table,

then using index(directly accessing a table) will be faster than SAS statement SET which is continuous accessing table .

So when you query a very large table and the result is greater than 10% of this large table, SAS will not use index (a way to directly accss table) and use a continuous way to access this table (just like SET statement). So whatever you use SQL or SET statement ,they are almost the same fast.

Ksharp

Message was edited by: xia keshanSome Errors found.

saspert
Pyrite | Level 9

Hi,

I have a better idea on the requirement now. The client want to base Information Maps on these very large DB (1-2 tables have 13b records and other tables have 100-800m records). Now, I am aware that Information Maps is not suited for these conditions but I would like to know the community's thoughts.

Thanks,

saspert

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 14 replies
  • 4691 views
  • 0 likes
  • 7 in conversation