Help using Base SAS procedures

Why is proc sql so slow in joins?

Accepted Solution Solved
Reply
Contributor
Posts: 48
Accepted Solution

Why is proc sql so slow in joins?

Hi,

I'm writing a program which outputs (at the end) one huge dataset on customers with 350 columns and 20millions rows.

During its execution, this program creates:

- a "mother table", let's say T1, with ID unique primary key and some fields. These IDs are the whole universe of customers IDs (20millions); so this table is like 10 columns and 20millions rows;

- several "child tables", let's say T2....T30, with ID unique primary key and some fields each. Every child table has only a subset of the IDs of T1, varying e.g. from 1% to 99% of the whole universe; I mean that for example T2 can be like 10 columns and 10.000 rows or 10 columns and 19million rows. This is not predictable. Anyway, the primary key ID is always unique and always included in table T1.

At the end, these tables must be joined to generate one output dataset: T1 is the "base table", and every T2...T30 is LEFT JOINED with T1 on their unique primary key ID.

So:

proc sql;

CREATE TABLE FINAL AS

SELECT

T1.ID,

T1.FIELD1,

...

T1.FIELD10,

T2.FIELD1,

...

T2.FIELD10,

...

...

T30.FIELD1,

...

T30.FIELD10

FROM

T1

LEFT JOIN T2 ON T1.ID = T2.ID

...

LEFT JOIN T30 ON T1.ID = T30.ID

;

quit;

This works and produces a 15GB table. However, this final proc sql lasts 5 hours! And it creates a temporary sas7butl file in the work directory bigger than 500GB!

Things to say:

  • ID is a numeric field with 9 digits
  • all the left joins are one-to-one (ID is unique primary key)
  • every dataset T1,T2,...,T30 is already sorted by ID (before being used in the final proc sql)
  • every dataset T1,T2,...,T30 is indexed on ID
  • using option "_method" on proc sql it shows this:

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjm

                  sqxsrc( T30 )

                  sqxsort

                      sqxjm

                          sqxsrc( T29 )

                          sqxsort

                              sqxjm

                                  .....

                                  sqxsort

                                      sqxjm

                                          sqxsrc( T2 )

                                          sqxsrc( T1 )

Question is: why it appears to be so slow and so resource-consuming? How could it be optimized? Are there any "hints" you may suggest?



Thank you very much.


Accepted Solutions
Solution
‎06-09-2015 10:29 AM
Super User
Posts: 7,854

Re: Why is proc sql so slow in joins?

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers

View solution in original post


All Replies
Super User
Super User
Posts: 7,988

Re: Why is proc sql so slow in joins?

The SQL engine in SAS is apparently quite slow with large amounts of data, others can explain why.  Why not just use a datastep merge:

data final;

     merge t1 t2 t3...;

     by id;

run;

Is there a reason you need to use SQL?  Also, in your example the fields from the second table will be dropped as they are the same name as in T1:

  1. SELECT 
  2. T1.ID, 
  3. T1.FIELD1, 
  4. ... 
  5. T1.FIELD10, 
  6. T2.FIELD1, 

E.g. FIELD comes from T1 and T2.

Another point, is it a good idea to have dataset with 350 columns?  Maybe there are better layouts - normalised view, or breaking the dataset up.

Solution
‎06-09-2015 10:29 AM
Super User
Posts: 7,854

Re: Why is proc sql so slow in joins?

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super Contributor
Posts: 298

Re: Why is proc sql so slow in joins?

If you want the job to be done in the least run-time and willing to write code, try hash objects. Still key indexing must be the best method provided you have memory to hold a big array.

Check this code :

data _null_;

     array m[999999999] _temporary_;

run;

The 9 nines are the maximum of your ID.

If your ID has a lesser maximum, replace the nines by that, and run the above code. If you are lucky, the compilation will go through.

Super User
Posts: 7,854

Re: Why is proc sql so slow in joins?

My first answer was given just to the title of your post.

After now going through it's content, I can only emphasize what I already said:

Sort all tables by ID (with PROC SORT) and do one merge in a data step to create your "monster" table. You will be impressed by the performance.

---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Super User
Posts: 10,044

Re: Why is proc sql so slow in joins?

You don't add WHERE clause into your SQL to filter obs  . But better way is using Hash Table.

Contributor
Posts: 48

Re: Why is proc sql so slow in joins?

I AM impressed. Really.

I rewrote such a proc sql with a data merge step as you suggested, and the very same output (20millions x 350) which took 5 hours and 500GB of temp space now is done in 10 minutes and without any temp space used.

10 minutes! Unbelievable!

Thank you guys for all your support.

P.S.: so proc sql really sucks... it's a pity because unlike sas true code, sql is well-understood among dbms fond people.

Super User
Super User
Posts: 7,988

Re: Why is proc sql so slow in joins?

Just to clarify, proc sql does not suck.  It is very useful, and has a variety of uses.  The simple fact is that the underlying concept of SAS is not a relational database and SQL was built to deal with relational databases.  Again, its using something in a way it was not intended.  In your case you are dealing with large volumes of data, so you should be looking at datawarehouses/marts and processes and tools designed for such tasks.

🔒 This topic is solved and locked.

Need further help from the community? Please ask a new question.

Discussion stats
  • 7 replies
  • 4284 views
  • 6 likes
  • 5 in conversation