Re: Performance Performance!

thomash123 · Posted 06-27-2012 08:42 AM

Hi all,

I work with SAS DI Studio 4.2. and I am currently reading through some high level papers on SAS Performance.

These are talking about increasing SQL Join performance by ordering columns the right way - so, what IS the right way to order columns? Let's say there are 2 tables each containing 80 columns and I need to join them by 3 columns #2, #44 and #78. Do I change #2, #44 and #78 to columns #1,#2,#3 before joining?

Another question is about "in job parallelization" - let's say there is 1 Job containing 1 source table, 2 Extracts based on the source table, 1 Append that appends the results of the 2 Extracts and a Table Loader that writes the data into another table.

How would I

- make the 2 Extracts work in parallel via DI Studio

- make this work without blocking the source table

Thanks a lot,

th

Patrick · Posted 06-27-2012 09:47 AM

Ordering columns:

Not sure but I would expect this depends on the DBMS. A lot of them maintain table statistics and will anyway optimise your SQL. I don't know if it makes a difference but may be ordering a where clause/join condition along an index or having the column with the fewest distinct values first could help a SQL optimiser to get things right.

Parallelizing JOBS is normally done via scheduler (eg. LSF). If you want to parallelize tasks within a single job then you would use a loop transformation - in your case with the inner job doing the extracts and then the outer job loading the result into target(s). Reading a source twice doesn't lock the table (only writing to as SAS table locks the table if not using SAS/Share). Certain procedures are also able to "multithread" - this is kind of parallel processing within a procedure.

In my experience when working with large volume data the biggest impact on performance has normally:

- reduce volumes as early as possible

- minimise passes through data

- minimise data exchange between SAS and DBMS (and reduce data volumes before exchanging).

- minimise disk I/O and do as much as possible in memory

- allocate enough memory to a job - avoid paging (=more disk I/O).

- investigate and tweak DB settings (eg. the insertbuffer as part of a DB library definition).

thomash123 · Posted 06-27-2012 09:53 AM

Thanks a lot for your quick reply!

I guess I have been a bit unspecific here when referring to SQL Joins. I was thinking about joins of SAS Datasets and not Joins of data in DBMS tables. Maybe there are some differences to consider?

th

Patrick · Posted 06-27-2012 10:04 AM

The one thing which comes to mind is that SAS can only use one index at a time - so you want to make sure that it uses the "best" one.

proc sql FEEDBACK;... will write you in the log how SAS re-shuffels your query.

I've also made the experience that it's sometimes really worth to re-formulate a join condition - especially if there are OR's in it.

Performance Performance!