06-27-2012 08:42 AM
I work with SAS DI Studio 4.2. and I am currently reading through some high level papers on SAS Performance.
These are talking about increasing SQL Join performance by ordering columns the right way - so, what IS the right way to order columns? Let's say there are 2 tables each containing 80 columns and I need to join them by 3 columns #2, #44 and #78. Do I change #2, #44 and #78 to columns #1,#2,#3 before joining?
Another question is about "in job parallelization" - let's say there is 1 Job containing 1 source table, 2 Extracts based on the source table, 1 Append that appends the results of the 2 Extracts and a Table Loader that writes the data into another table.
How would I
- make the 2 Extracts work in parallel via DI Studio
- make this work without blocking the source table
Thanks a lot,
06-27-2012 09:47 AM
Not sure but I would expect this depends on the DBMS. A lot of them maintain table statistics and will anyway optimise your SQL. I don't know if it makes a difference but may be ordering a where clause/join condition along an index or having the column with the fewest distinct values first could help a SQL optimiser to get things right.
Parallelizing JOBS is normally done via scheduler (eg. LSF). If you want to parallelize tasks within a single job then you would use a loop transformation - in your case with the inner job doing the extracts and then the outer job loading the result into target(s). Reading a source twice doesn't lock the table (only writing to as SAS table locks the table if not using SAS/Share). Certain procedures are also able to "multithread" - this is kind of parallel processing within a procedure.
In my experience when working with large volume data the biggest impact on performance has normally:
- reduce volumes as early as possible
- minimise passes through data
- minimise data exchange between SAS and DBMS (and reduce data volumes before exchanging).
- minimise disk I/O and do as much as possible in memory
- allocate enough memory to a job - avoid paging (=more disk I/O).
- investigate and tweak DB settings (eg. the insertbuffer as part of a DB library definition).
06-27-2012 09:53 AM
Thanks a lot for your quick reply!
I guess I have been a bit unspecific here when referring to SQL Joins. I was thinking about joins of SAS Datasets and not Joins of data in DBMS tables. Maybe there are some differences to consider?
06-27-2012 10:04 AM
The one thing which comes to mind is that SAS can only use one index at a time - so you want to make sure that it uses the "best" one.
proc sql FEEDBACK;... will write you in the log how SAS re-shuffels your query.
I've also made the experience that it's sometimes really worth to re-formulate a join condition - especially if there are OR's in it.