BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Edoedoedo
Pyrite | Level 9

Hi,

I'm writing a program which outputs (at the end) one huge dataset on customers with 350 columns and 20millions rows.

During its execution, this program creates:

- a "mother table", let's say T1, with ID unique primary key and some fields. These IDs are the whole universe of customers IDs (20millions); so this table is like 10 columns and 20millions rows;

- several "child tables", let's say T2....T30, with ID unique primary key and some fields each. Every child table has only a subset of the IDs of T1, varying e.g. from 1% to 99% of the whole universe; I mean that for example T2 can be like 10 columns and 10.000 rows or 10 columns and 19million rows. This is not predictable. Anyway, the primary key ID is always unique and always included in table T1.

At the end, these tables must be joined to generate one output dataset: T1 is the "base table", and every T2...T30 is LEFT JOINED with T1 on their unique primary key ID.

So:

proc sql;

CREATE TABLE FINAL AS

SELECT

T1.ID,

T1.FIELD1,

...

T1.FIELD10,

T2.FIELD1,

...

T2.FIELD10,

...

...

T30.FIELD1,

...

T30.FIELD10

FROM

T1

LEFT JOIN T2 ON T1.ID = T2.ID

...

LEFT JOIN T30 ON T1.ID = T30.ID

;

quit;

This works and produces a 15GB table. However, this final proc sql lasts 5 hours! And it creates a temporary sas7butl file in the work directory bigger than 500GB!

Things to say:

  • ID is a numeric field with 9 digits
  • all the left joins are one-to-one (ID is unique primary key)
  • every dataset T1,T2,...,T30 is already sorted by ID (before being used in the final proc sql)
  • every dataset T1,T2,...,T30 is indexed on ID
  • using option "_method" on proc sql it shows this:

NOTE: SQL execution methods chosen are:

      sqxcrta

          sqxfil

              sqxjm

                  sqxsrc( T30 )

                  sqxsort

                      sqxjm

                          sqxsrc( T29 )

                          sqxsort

                              sqxjm

                                  .....

                                  sqxsort

                                      sqxjm

                                          sqxsrc( T2 )

                                          sqxsrc( T1 )

Question is: why it appears to be so slow and so resource-consuming? How could it be optimized? Are there any "hints" you may suggest?



Thank you very much.

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

View solution in original post

7 REPLIES 7
RW9
Diamond | Level 26 RW9
Diamond | Level 26

The SQL engine in SAS is apparently quite slow with large amounts of data, others can explain why.  Why not just use a datastep merge:

data final;

     merge t1 t2 t3...;

     by id;

run;

Is there a reason you need to use SQL?  Also, in your example the fields from the second table will be dropped as they are the same name as in T1:

  1. SELECT 
  2. T1.ID, 
  3. T1.FIELD1, 
  4. ... 
  5. T1.FIELD10, 
  6. T2.FIELD1, 

E.g. FIELD comes from T1 and T2.

Another point, is it a good idea to have dataset with 350 columns?  Maybe there are better layouts - normalised view, or breaking the dataset up.

Kurt_Bremser
Super User

PROC SQL throws everything from the input data into a big utility file that almost literally includes the oft-named kitchen sink and works from there. Due to this, it causes a very big number of random disk accesses that slow down the whole process. If several users do that at the same time, the WORK area can become so congested that the server slows down to a virtual halt.

A combination of SORTs and DATA steps works mostly in sequential fashion and allows the storage to work more optimally. If you also can make sure that MERGEs use physically separate libraries for input and output, you'll spread the load and increase performance even more.

At one point I compared a sequence of data steps and proc sorts, which was given to us by a SAS consultant, with a PROC SQL I had written to perform the same logical task. My SQL was about 20 lines, his code was 2 (A4) pages. Mine took 4 hours, his 20 minutes. Since then I only use SQL when I really need it (cartesian product).

KachiM
Rhodochrosite | Level 12

If you want the job to be done in the least run-time and willing to write code, try hash objects. Still key indexing must be the best method provided you have memory to hold a big array.

Check this code :

data _null_;

     array m[999999999] _temporary_;

run;

The 9 nines are the maximum of your ID.

If your ID has a lesser maximum, replace the nines by that, and run the above code. If you are lucky, the compilation will go through.

Kurt_Bremser
Super User

My first answer was given just to the title of your post.

After now going through it's content, I can only emphasize what I already said:

Sort all tables by ID (with PROC SORT) and do one merge in a data step to create your "monster" table. You will be impressed by the performance.

Ksharp
Super User

You don't add WHERE clause into your SQL to filter obs  . But better way is using Hash Table.

Edoedoedo
Pyrite | Level 9

I AM impressed. Really.

I rewrote such a proc sql with a data merge step as you suggested, and the very same output (20millions x 350) which took 5 hours and 500GB of temp space now is done in 10 minutes and without any temp space used.

10 minutes! Unbelievable!

Thank you guys for all your support.

P.S.: so proc sql really sucks... it's a pity because unlike sas true code, sql is well-understood among dbms fond people.

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Just to clarify, proc sql does not suck.  It is very useful, and has a variety of uses.  The simple fact is that the underlying concept of SAS is not a relational database and SQL was built to deal with relational databases.  Again, its using something in a way it was not intended.  In your case you are dealing with large volumes of data, so you should be looking at datawarehouses/marts and processes and tools designed for such tasks.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 14890 views
  • 7 likes
  • 5 in conversation