topic Re: data step in SAS Programming

data step

kds — Mon, 10 Feb 2014 13:26:59 GMT

Hi SAS Experts,

I have a data step:

data x_final(sortedby=c1 c2 c3 c4 index=(d1 c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17));
set x_updated;
by c1 c2 c3 c4;
run;

This data step takes around 3 hrs if x_updated has around 20 million records. If x_updated has around 0.25 million records, then data step takes only 5 minutes to complete.
My question to experts is that ->
Is there any way by which I can reduce execution hours even when source dataset has greater than or equal to 20 million records?

Thanks

Re: data step

LinusH — Mon, 10 Feb 2014 13:40:48 GMT

It often comes down to how you are setting up your data model. No clues are gven here about user requirements, search patterns, update frequency etc.

Physically, do you really need so many indexes? Do they all have at least more that 10 distinct values with normal distribution?

Second, if x_final (specified as work table here, but it can't be, right?) is in a SAS Base engine, it will perform much better if you apply the indexes after the table creation.

Consider to use a SPDE libname instead, it will update the indexes in parallel, using multi-threading.

Do you need to recreate the whole table each time? Consider a append/insert approach.

Re: data step

Astounding — Mon, 10 Feb 2014 14:24:43 GMT

Well, it's always possible that your actual DATA step is more complex than the one you have shown here. But looking at just this code, you don't really need a DATA step at all. The observations and data values in X_FINAL are identical to X_UPDATED. You could just use X_UPDATED and build your indexes (hate that word!) using a different tool that doesn't need to read in each observation individually the way a DATA step does.

Looking at the number of variables for which you are creating an index, it concerns me that the programs that use the indexes might be less efficient as well. But you haven't posted any of that code so it's difficult to judge.