Solved: Re: working with large data sets

skipper · Posted 04-30-2012 09:31 AM

HI, I have code to compute 'N' means, in my code N=200.

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

%macro compute;

data data_set;

set data_set;

%do n=1 %to &cx -1 ;

r&n = (X - &mx)*(lag&n(X) - &mx) ;

%end;

run;

%mend;

%compute

PROC MEANS DATA=data_set mean;

OUTPUT OUT=want;

run;

**************************************

BUT, in my real data, N=5,000,000 !!

(and the above code becomes to slow with N>20,000.)

How can this be done on such large 'N' ?

Astounding · Posted 04-30-2012 02:52 PM

This program ran without error for me:

data _null_;

array t {5000000};

x=1;

run;

It took about 2 minutes, but it ran. That gives you an alternative way to sum up the pieces from the view named WANT:

data final;

array t_totals {4999999};

set want end=done;

t_totals{t} + rt;

if done then do T=1 to 4999999;

RT = t_totals{T};

output;

end;

keep T RT;

run;

This will still crank for a while since it has to process trillions of records. But if the top DATA step works, the bottom one will work as well.

Good luck!

View solution in original post

Doc_Duke · Posted 04-30-2012 09:56 AM

Skipper,

Tell us what you are trying to do.

This looks like you are creating "N" new columns in data_set.

%do n=1 %to &cx -1 ;

r&n = (X - &mx)*(lag&n(X) - &mx) ;

%end;

If my reading of your code is correct, SAS will not create a dataset with 5,000,000 columns, so this approach won't work.

You will probably be better off creating a dataset with lots of short rows and grouping somehow, but without knowing more about the goal, it is hard to help.

Doc Muhlbaier

Duke

skipper · Posted 04-30-2012 12:49 PM

Trying to compute the 'autocorrelations' for time series data set, 5 million observations of a single variable X.

R(T) = SUM ( X - mean(X) ) * ( LagT(X) - Mean(X) ) / (N - T)

must be computed and summed for each lag times T=1,2,3,4,.....,N. hence the 5million columns

I dont want to create 5,000,000 columns, its my poor coding knowledge of a better way.

but i need the compute the above equation R(1), R(2), .... R(5,000,000).

So all i really need is just a two column data set with variables 'X' and 'R(T)' with 5 million observations.

Any ideas how this could be done?

Reeza · Posted 04-30-2012 01:11 PM

Do you have SAS ETS?

Can you use proc timeseries and then deal with the output from there rather than a datastep and macro code?

skipper · Posted 04-30-2012 01:33 PM

thanks it seems

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

data data_set;

set data_Set;

obsN = _N_;

run;

proc timeseries data=data_set

out=out

outcorr=timedomain;

id obsN interval=day;

var X;

run;

works well, but it only gives the first 20 rows for the outcorr=timedomain whereas I need all of the rows.

I need it to find R(1), R(2), ... R(200),

Is there some way to specifiy how many rows the 'outcorr=<>' creats in the dataset.

Astounding · Posted 04-30-2012 01:36 PM

skipper,

First, you have to realize you will need some powerful hardware. This problem requires trillions of computations. Assuming you have room to store the results, this approach creates a data set with trillions of observations but only 2 variables: T and the contribution to R(T). From that point, the data will need to get summed up.

Compute your macro variables as before (&MX and &CX). Then take each relevant pair of observations:

data want (keep=T RT);

set have (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;

set have (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

I think I got the formulas right. Naturally, you'll want to test this on a small data set first!

Good luck.

Astounding · Posted 04-30-2012 01:40 PM

An important afterthought ...

If all you need is the final sum, you don't have to store the trillions of observations. Instead:

data want (keep=T RT) / view=want;

The rest of the DATA step is the same, then

proc means data=want sum;

var RT;

Or, add an OUTPUT statement to the PROC MEANS to save the result in a data set.

But this might not be applicable if you need to save the separate R(T) for each value of T.

skipper · Posted 04-30-2012 01:56 PM

I dont understand, like

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want2 (keep=T RT) / view=want2;

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

proc means data=want2 sum;

var RT;

produces only 123 rows? it should be R(1) to R(200)?

Astounding · Posted 04-30-2012 02:01 PM

With 200 records, wouldn't you expect 199 values for T?

But this PROC MEANS should generate only 1 row, with the sum of RT across all values of T.

skipper · Posted 04-30-2012 01:48 PM

looks interesting,... but

data data_set (drop = i);

do i = 1 to 200;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT);

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

creates a table with 19,900 observations!..

it should only have 200 observations, vector X has size 200, so it should be T = 1, 2, 3, 4, .... 200 and R(1), R(2), ... R(200).

Astounding · Posted 04-30-2012 01:53 PM

Yes, that sounds about right. This is getting the contribution to R(T) for each pair of observations. You still need to sum it up. For your small data set, you could use:

proc means data=want sum;

var RT;

class T;

run;

With larger numbers of T values, it won't be so easy.

skipper · Posted 04-30-2012 02:21 PM

so this code

data data_set (drop = i);

do i = 1 to 20000;

X = rand('NORMAL',0,1);

output;

end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT) / view=want;

set data_set (keep=x);

if _n_ > 1 then do i = 1 to _n_ - 1;

set data_set (rename=(x = lagged_x)) point=i;

T = _n_ - i;

RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

output;

end;

run;

proc means data=want sum;

OUTPUT OUT=want3;

var RT;

class T;

run;

data want4;

set want3;

where _STAT_="MEAN" and T > 0;

keep T RT;

run;

gives me what I want, but it took 6 minutes to compute, and thats only for 20,000 observations, which is not large enough I need at least 2 million.

Astounding · Posted 04-30-2012 02:35 PM

It will take a while no matter how you do it. Remember how 200 observations morphed into 19,900? You are getting similar expansion for the 20,000 records. The LOG indicates how many observations are being read in by PROC MEANS. As the number of observations increases, PROC MEANS will eventually fail because there will be too many values of T.

Note that you should modify your final PROC MEANS to eliminate the need for a subsequent DATA step:

proc means data=want NWAY;

var RT;

class T;

output out=want3 (keep=T RT) sum=;

run;

But that is not the time-consuming part. The number of calculations increases roughly proportional to the square of the number of observations. Like I said, you will end up in the trillions by the time you are done. I don't know if sorting that many observations would be possible ... if so, it would allow PROC MEANS to switch from a CLASS to a BY statement, which would allow PROC MEANS to run. How long it would take is anybody's guess. But sorting the trillions of records would certainly take longer than running PROC MEANS.

Astounding · Posted 04-30-2012 02:52 PM

This program ran without error for me:

data _null_;

array t {5000000};

x=1;

run;

It took about 2 minutes, but it ran. That gives you an alternative way to sum up the pieces from the view named WANT:

data final;

array t_totals {4999999};

set want end=done;

t_totals{t} + rt;

if done then do T=1 to 4999999;

RT = t_totals{T};

output;

end;

keep T RT;

run;

This will still crank for a while since it has to process trillions of records. But if the top DATA step works, the bottom one will work as well.

Good luck!

Astounding · Posted 04-30-2012 03:13 PM

OK, final afterthought.

It should be possible to combine all the processing into a single DATA step (if the array will fit in memory). Defining a temporary array will speed things up:

data want;

array t_totals {4999999} _temporary_;

set have (keep=x) end=done;

if _n_ > 1 then do i = 1 to _n_ - 1;

set have (keep=x rename=(x = lagged_x)) point=i;

T = _n_ - i;

t_totals{T} + (x - &mx) * (lagged_x - &mx) / (&cx - T);

end;

if done;

do T = 1 to 4999999;

RT = t_totals{T};

output;

end;

keep T RT;

run;

OK, I'm pretty sure I'm done here and can't speed it up beyond that.

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!