topic Re: working with large data sets in SAS Procedures

working with large data sets

skipper — Mon, 30 Apr 2012 13:31:33 GMT

HI, I have code to compute 'N' means, in my code N=200.

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

%macro compute;

data data_set;

set data_set;

%do n=1 %to &cx -1 ;

r&n = (X - &mx)*(lag&n(X) - &mx) ;

%end;

run;

%mend;

%compute

PROC MEANS DATA=data_set mean;

OUTPUT OUT=want;

run;

**************************************

BUT, in my real data, N=5,000,000 !!

(and the above code becomes to slow with N>20,000.)

How can this be done on such large 'N' ?

Re: working with large data sets

Doc_Duke — Mon, 30 Apr 2012 13:56:35 GMT

Skipper,

Tell us what you are trying to do.

This looks like you are creating "N" new columns in data_set.

%do n=1 %to &cx -1 ;

r&n = (X - &mx)*(lag&n(X) - &mx) ;

%end;

If my reading of your code is correct, SAS will not create a dataset with 5,000,000 columns, so this approach won't work.

You will probably be better off creating a dataset with lots of short rows and grouping somehow, but without knowing more about the goal, it is hard to help.

Doc Muhlbaier

Duke

Re: working with large data sets

skipper — Mon, 30 Apr 2012 16:49:58 GMT

Trying to compute the 'autocorrelations' for time series data set, 5 million observations of a single variable X.

R(T) = SUM ( X - mean(X) ) * ( LagT(X) - Mean(X) ) / (N - T)

must be computed and summed for each lag times T=1,2,3,4,.....,N. hence the 5million columns

I dont want to create 5,000,000 columns, its my poor coding knowledge of a better way.

but i need the compute the above equation R(1), R(2), .... R(5,000,000).

So all i really need is just a two column data set with variables 'X' and 'R(T)' with 5 million observations.

Any ideas how this could be done?

Re: working with large data sets

Reeza — Mon, 30 Apr 2012 17:11:33 GMT

Do you have SAS ETS?

Can you use proc timeseries and then deal with the output from there rather than a datastep and macro code?

Re: working with large data sets

skipper — Mon, 30 Apr 2012 17:33:38 GMT

thanks it seems

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

data data_set;

set data_Set;

obsN = _N_;

run;

proc timeseries data=data_set

out=out

outcorr=timedomain;

id obsN interval=day;

var X;

run;

works well, but it only gives the first 20 rows for the outcorr=timedomain whereas I need all of the rows.

I need it to find R(1), R(2), ... R(200),

Is there some way to specifiy how many rows the 'outcorr=<>' creats in the dataset.

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 17:36:21 GMT

skipper,

First, you have to realize you will need some powerful hardware. This problem requires trillions of computations. Assuming you have room to store the results, this approach creates a data set with trillions of observations but only 2 variables: T and the contribution to R(T). From that point, the data will need to get summed up.

Compute your macro variables as before (&MX and &CX). Then take each relevant pair of observations:

data want (keep=T RT);

set have (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;

set have (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

I think I got the formulas right. Naturally, you'll want to test this on a small data set first!

Good luck.

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 17:40:58 GMT

An important afterthought ...

If all you need is the final sum, you don't have to store the trillions of observations. Instead:

data want (keep=T RT) / view=want;

The rest of the DATA step is the same, then

proc means data=want sum;

var RT;

Or, add an OUTPUT statement to the PROC MEANS to save the result in a data set.

But this might not be applicable if you need to save the separate R(T) for each value of T.

Re: working with large data sets

skipper — Mon, 30 Apr 2012 17:48:25 GMT

looks interesting,... but

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT);

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

creates a table with 19,900 observations!..

it should only have 200 observations, vector X has size 200, so it should be T = 1, 2, 3, 4, .... 200 and R(1), R(2), ... R(200).

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 17:53:10 GMT

Yes, that sounds about right. This is getting the contribution to R(T) for each pair of observations. You still need to sum it up. For your small data set, you could use:

proc means data=want sum;

var RT;

class T;

run;

With larger numbers of T values, it won't be so easy.

Re: working with large data sets

skipper — Mon, 30 Apr 2012 17:56:36 GMT

I dont understand, like

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want2 (keep=T RT) / view=want2;

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

proc means data=want2 sum;

var RT;

produces only 123 rows? it should be R(1) to R(200)?

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 18:01:27 GMT

With 200 records, wouldn't you expect 199 values for T?

But this PROC MEANS should generate only 1 row, with the sum of RT across all values of T.

Re: working with large data sets

skipper — Mon, 30 Apr 2012 18:04:28 GMT

seems to be working, need to run some more checks

Re: working with large data sets

skipper — Mon, 30 Apr 2012 18:21:32 GMT

so this code

data data_set (drop = i);

do i = 1 to 20000;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT) / view=want;

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

proc means data=want sum;

OUTPUT OUT=want3;

var RT;

class T;

run;

data want4;

set want3;

where _STAT_="MEAN" and T > 0;

keep T RT;

run;

gives me what I want, but it took 6 minutes to compute, and thats only for 20,000 observations, which is not large enough I need at least 2 million.

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 18:35:54 GMT

It will take a while no matter how you do it. Remember how 200 observations morphed into 19,900? You are getting similar expansion for the 20,000 records. The LOG indicates how many observations are being read in by PROC MEANS. As the number of observations increases, PROC MEANS will eventually fail because there will be too many values of T.

Note that you should modify your final PROC MEANS to eliminate the need for a subsequent DATA step:

proc means data=want NWAY;

var RT;

class T;

output out=want3 (keep=T RT) sum=;

run;

But that is not the time-consuming part. The number of calculations increases roughly proportional to the square of the number of observations. Like I said, you will end up in the trillions by the time you are done. I don't know if sorting that many observations would be possible ... if so, it would allow PROC MEANS to switch from a CLASS to a BY statement, which would allow PROC MEANS to run. How long it would take is anybody's guess. But sorting the trillions of records would certainly take longer than running PROC MEANS.

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 18:52:52 GMT

This program ran without error for me:

data _null_;

array t {5000000};

x=1;

run;

It took about 2 minutes, but it ran. That gives you an alternative way to sum up the pieces from the view named WANT:

data final;

array t_totals {4999999};

set want end=done;

t_totals{t} + rt;

if done then do T=1 to 4999999;

RT = t_totals{T};

output;

end;

keep T RT;

run;

This will still crank for a while since it has to process trillions of records. But if the top DATA step works, the bottom one will work as well.

Good luck!

Re: working with large data sets

Astounding — Mon, 30 Apr 2012 19:13:44 GMT

OK, final afterthought.

It should be possible to combine all the processing into a single DATA step (if the array will fit in memory). Defining a temporary array will speed things up:

data want;

array t_totals {4999999} _temporary_;

set have (keep=x) end=done;

if _n_ > 1 then do i = 1 to _n_ - 1;

set have (keep=x rename=(x = lagged_x)) point=i;

T = _n_ - i;

t_totals{T} + (x - &mx) * (lagged_x - &mx) / (&cx - T);

end;

if done;

do T = 1 to 4999999;

RT = t_totals{T};

output;

end;

keep T RT;

run;

OK, I'm pretty sure I'm done here and can't speed it up beyond that.