Hi, I'm writing a program which outputs (at the end) one huge dataset on customers with 350 columns and 20millions rows. During its execution, this program creates: - a "mother table", let's say T1, with ID unique primary key and some fields. These IDs are the whole universe of customers IDs (20millions); so this table is like 10 columns and 20millions rows; - several "child tables", let's say T2....T30, with ID unique primary key and some fields each. Every child table has only a subset of the IDs of T1, varying e.g. from 1% to 99% of the whole universe; I mean that for example T2 can be like 10 columns and 10.000 rows or 10 columns and 19million rows. This is not predictable. Anyway, the primary key ID is always unique and always included in table T1. At the end, these tables must be joined to generate one output dataset: T1 is the "base table", and every T2...T30 is LEFT JOINED with T1 on their unique primary key ID. So:
proc sql;
CREATE TABLE FINAL AS
SELECT
T1.ID,
T1.FIELD1,
...
T1.FIELD10,
T2.FIELD1,
...
T2.FIELD10,
...
...
T30.FIELD1,
...
T30.FIELD10
FROM
T1
LEFT JOIN T2 ON T1.ID = T2.ID
...
LEFT JOIN T30 ON T1.ID = T30.ID
;
quit;
This works and produces a 15GB table. However, this final proc sql lasts 5 hours! And it creates a temporary sas7butl file in the work directory bigger than 500GB! Things to say: ID is a numeric field with 9 digits all the left joins are one-to-one (ID is unique primary key) every dataset T1,T2,...,T30 is already sorted by ID (before being used in the final proc sql) every dataset T1,T2,...,T30 is indexed on ID using option "_method" on proc sql it shows this:
NOTE: SQL execution methods chosen are:
sqxcrta
sqxfil
sqxjm
sqxsrc( T30 )
sqxsort
sqxjm
sqxsrc( T29 )
sqxsort
sqxjm
.....
sqxsort
sqxjm
sqxsrc( T2 )
sqxsrc( T1 )
Question is: why it appears to be so slow and so resource-consuming? How could it be optimized? Are there any "hints" you may suggest? Thank you very much.
... View more