10-14-2011 05:40 PM
Does anyone have any experience on running Proc sql on very large databases (say healthcare claims DB having 500m-1b records)? Any references to papers will also work.
And any advantage if the DB is teradata?
SAS is at 9.1.3
I am familiar with hash joins (used it once maybe) but am not a technical person to understand the theory behind it. Any guidance would be appreciated.
10-14-2011 05:48 PM
I would consider tables of this size to be small. I frequently interact with tables in SAS that are multiple terrabytes in size. Tens or billions of rows and thousands of columns wide. Your question is too broad to answer more plainly than: Proc SQL should have no problem working with your data, whether it is stored in sas, teradata or any number of other sources.
10-14-2011 05:53 PM
Thank you - just thinking aloud - wouldnt the processing time take up to several hours for proc sql? I know hash join does the same thing in less time.
10-14-2011 06:04 PM
Could you provide a more specific example to what you are asking. Depending on the complexity of the join and other conditions in a query, just because you are dealing with large tables doesn't mean a join will take a substantial amount of time, especially with the proper hardware to support such data.
Earlier today for example, I ran a query that joins together 12 sas tables of about 600million records each and it completed in about 5 hours. For my scheduling, this is more than fast enough, and has a lot of room for improvement if needed.
It is all relative to what you are trying to accomplish...
10-14-2011 06:15 PM
Here is another thought specifically pertaining to hash joins and proc sql, in case I have previously missed what you are trying to ask.
proc sql with use a hash join method for inner joins between tables. you can see this by adding the _method option to your proc sql statement. However it will not use this method for other types of joins. You can manually produce a hash join method using the hash object available to data steps. I do not know how data being stored in an outside rdbms would effect this.
try running your query with the _method option and seeing what the explain plain says it is doing.
proc sql _method;
* your query here;
if you see 'sqxjhsh' then it is using a hash join. if you see 'sqxjm' then it is using a join-merge.
10-15-2011 10:13 PM
The typical approach in dealing with large tables in a DB would be to push execution to the database, reduce the volumes as much as possible and only then transfer the data to SAS for further execution (if this is needed at all).
If you use PROC SQL then the SAS/ACCESS engine will try to convert your SAS SQL syntax to DB SQL syntax and send as much as possible to the DB. Look up options SASTRACE and SASTRACELOC as turning on these options will show you in the log how much of your SAS SQL could be sent to the DB.
I wouldn't use a data step and SAS hash objects if you can avoid it as this means that most of the data needs to be transfered to SAS (a where clause in the set statement will still be sent to the DB and subsetting happens there).
You can also use pass-through SQL (there is even an option in EG4.2 which converts wizard generated SAS SQL into DB SQL in a pass-through block). Pass through SQL means basically that you define within PROC SQL syntax in the DB flavor - and this code is sent 1:1 to the DB.
One of the advantages of doing this is that you can give this code as-is to a DBA for optimisation - or even get a DBA to provide you with optimised code which you then just copy-paste into your SAS program.
And about Teradata: Thinking about all this parallel processing and magic going on in Teradata you should take advantage of it (meaning: have processing in Teradata).
10-17-2011 07:15 PM
As others have mentioned, the details are important, so here is an anecdote. I have a PROC SQL query with two tables with about ~2m rows running on SAS 9.1.3 on a laptop, and I can reduce the time from 4-6 minutes to ~20 seconds (i.e., >90%) by removing DISTINCT from COUNT(). I get the same answer faster using PROC SORT NODUPKEY before PROC SQL.
10-17-2011 10:32 PM
That is just the advantage of using SAS, SQL is not suited for very large table. SQL is designed based on Cartesian Product which will generate incredible the number of join with increasing number of obs.
The only thing promoting speed is to make a index, but it is also spend lots of time to build index.
So For your situation (very large table), recommend to use data step or Hash Table to complete your task.
10-17-2011 10:52 PM
Indexing does no always provide an advantage. In some of the larger table I deal with creating indexs actually hurt performance. There isn't necessarily a rule of thumb to knowing, more of trial and error.
Also the vast majority of databases that I can think of do have hash joins, merge joins, and cartesian products joins available to them and as I mentioned earlier the default join mechanic in proc sql, for an inner join, is the hash method.
10-18-2011 02:57 AM
I know index is not always a good choice, especially for only querying a little obs from large table which
is not as fast as consecutive enter as SAS's set statement.
But SAS is smart enough to decide whether to use index automatically.
Using SAS code is faster than SQL because the principle they are designed is different.
So you will see at some case SQL is faster and at another some case SAS is faster especially for large table.
For some operation on large table, SQL almost can not but SAS can ,that is the advantage of SAS than SQL.
If only consider join table, SAS is also faster than SQL for very large table.
10-18-2011 12:06 PM
Thanks everyone for your opinions. I guess I will have to come up with a very specific scenario rather than a vague one. Hopefully, the assignment with the large tables will come my way to work on - I can elaborate more at that point.
10-20-2011 08:50 AM
be aware that teradata offers a unique set of optimisation options for which SAS is not always the best training ground. I expect that is why most teradata sql programmers need formal training before being "let loose" on full production volumes. Ask if there any teradata sql training courses available to you.
10-26-2011 01:11 PM
Based on what you said, Index are not always useful. Can you suggest me good way for finding Null values from a column? I am using SQL statement in SAS right now. Do you think, using SAS statements will provide me the result faster?
10-27-2011 06:20 AM
If the result (i.e. finding Null values from a column) you need only occupy ten percent or less in a whole large table,
then using index(directly accessing a table) will be faster than SAS statement SET which is continuous accessing table .
So when you query a very large table and the result is greater than 10% of this large table, SAS will not use index (a way to directly accss table) and use a continuous way to access this table (just like SET statement). So whatever you use SQL or SET statement ,they are almost the same fast.
Message was edited by: xia keshanSome Errors found.
11-03-2011 05:19 PM
I have a better idea on the requirement now. The client want to base Information Maps on these very large DB (1-2 tables have 13b records and other tables have 100-800m records). Now, I am aware that Information Maps is not suited for these conditions but I would like to know the community's thoughts.