Solved: Re: Why is proc sql so slow in joins?

Edoedoedo · Posted 06-09-2015 09:54 AM

Hi,

I'm writing a program which outputs (at the end) one huge dataset on customers with 350 columns and 20millions rows.

During its execution, this program creates:

- a "mother table", let's say T1, with ID unique primary key and some fields. These IDs are the whole universe of customers IDs (20millions); so this table is like 10 columns and 20millions rows;

- several "child tables", let's say T2....T30, with ID unique primary key and some fields each. Every child table has only a subset of the IDs of T1, varying e.g. from 1% to 99% of the whole universe; I mean that for example T2 can be like 10 columns and 10.000 rows or 10 columns and 19million rows. This is not predictable. Anyway, the primary key ID is always unique and always included in table T1.

At the end, these tables must be joined to generate one output dataset: T1 is the "base table", and every T2...T30 is LEFT JOINED with T1 on their unique primary key ID.

So:

proc sql;
CREATE TABLE FINAL AS
SELECT
T1.ID,
T1.FIELD1,
...
T1.FIELD10,
T2.FIELD1,
...
T2.FIELD10,
...
...
T30.FIELD1,
...
T30.FIELD10

FROM
T1
LEFT JOIN T2 ON T1.ID = T2.ID
...
LEFT JOIN T30 ON T1.ID = T30.ID
;
quit;

This works and produces a 15GB table. However, this final proc sql lasts 5 hours! And it creates a temporary sas7butl file in the work directory bigger than 500GB!

Things to say:

ID is a numeric field with 9 digits
all the left joins are one-to-one (ID is unique primary key)
every dataset T1,T2,...,T30 is already sorted by ID (before being used in the final proc sql)
every dataset T1,T2,...,T30 is indexed on ID
using option "_method" on proc sql it shows this:

NOTE: SQL execution methods chosen are:

sqxcrta

sqxfil

sqxjm

sqxsrc( T30 )

sqxsort

sqxjm

sqxsrc( T29 )

sqxsort

sqxjm

.....

sqxsort

sqxjm

sqxsrc( T2 )

sqxsrc( T1 )

Question is: why it appears to be so slow and so resource-consuming? How could it be optimized? Are there any "hints" you may suggest?

Thank you very much.

Kurt_Bremser · Posted 06-09-2015 10:29 AM

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

RW9 · Posted 06-09-2015 10:07 AM

The SQL engine in SAS is apparently quite slow with large amounts of data, others can explain why. Why not just use a datastep merge:

data final;

merge t1 t2 t3...;

by id;

run;

Is there a reason you need to use SQL? Also, in your example the fields from the second table will be dropped as they are the same name as in T1:

SELECT
T1.ID,
T1.FIELD1,
...
T1.FIELD10,
T2.FIELD1,

E.g. FIELD comes from T1 and T2.

Another point, is it a good idea to have dataset with 350 columns? Maybe there are better layouts - normalised view, or breaking the dataset up.

Kurt_Bremser · Posted 06-09-2015 10:29 AM

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

KachiM · Posted 06-09-2015 11:26 AM

If you want the job to be done in the least run-time and willing to write code, try hash objects. Still key indexing must be the best method provided you have memory to hold a big array.

Check this code :

data _null_;

array m[999999999] _temporary_;

run;

The 9 nines are the maximum of your ID.

If your ID has a lesser maximum, replace the nines by that, and run the above code. If you are lucky, the compilation will go through.

Kurt_Bremser · Posted 06-10-2015 03:26 AM

My first answer was given just to the title of your post.

After now going through it's content, I can only emphasize what I already said:

Sort all tables by ID (with PROC SORT) and do one merge in a data step to create your "monster" table. You will be impressed by the performance.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 06-10-2015 08:29 AM

You don't add WHERE clause into your SQL to filter obs . But better way is using Hash Table.

Edoedoedo · Posted 06-11-2015 04:41 AM

I AM impressed. Really.

I rewrote such a proc sql with a data merge step as you suggested, and the very same output (20millions x 350) which took 5 hours and 500GB of temp space now is done in 10 minutes and without any temp space used.

10 minutes! Unbelievable!

Thank you guys for all your support.

P.S.: so proc sql really sucks... it's a pity because unlike sas true code, sql is well-understood among dbms fond people.

RW9 · Posted 06-11-2015 05:30 AM

Just to clarify, proc sql does not suck. It is very useful, and has a variety of uses. The simple fact is that the underlying concept of SAS is not a relational database and SQL was built to deal with relational databases. Again, its using something in a way it was not intended. In your case you are dealing with large volumes of data, so you should be looking at datawarehouses/marts and processes and tools designed for such tasks.

SAS Innovate 2025: Call for Content

Classroom Training Available!