HI, I have code to compute 'N' means, in my code N=200.
data data_set (drop = i);
do i = 1 to 200;
X = rand('NORMAL',0,1);
output;
end;
run;
proc sql;
select mean(x) into :mx from data_set;
select count(x) into :cx from data_set;
quit;
%macro compute;
data data_set;
set data_set;
%do n=1 %to &cx -1 ;
r&n = (X - &mx)*(lag&n(X) - &mx) ;
%end;
run;
%mend;
%compute
PROC MEANS DATA=data_set mean;
OUTPUT OUT=want;
run;
**************************************
BUT, in my real data, N=5,000,000 !!
(and the above code becomes to slow with N>20,000.)
How can this be done on such large 'N' ?
This program ran without error for me:
data _null_;
array t {5000000};
x=1;
run;
It took about 2 minutes, but it ran. That gives you an alternative way to sum up the pieces from the view named WANT:
data final;
array t_totals {4999999};
set want end=done;
t_totals{t} + rt;
if done then do T=1 to 4999999;
RT = t_totals{T};
output;
end;
keep T RT;
run;
This will still crank for a while since it has to process trillions of records. But if the top DATA step works, the bottom one will work as well.
Good luck!
Skipper,
Tell us what you are trying to do.
This looks like you are creating "N" new columns in data_set.
%do n=1 %to &cx -1 ;
r&n = (X - &mx)*(lag&n(X) - &mx) ;
%end;
If my reading of your code is correct, SAS will not create a dataset with 5,000,000 columns, so this approach won't work.
You will probably be better off creating a dataset with lots of short rows and grouping somehow, but without knowing more about the goal, it is hard to help.
Doc Muhlbaier
Duke
Trying to compute the 'autocorrelations' for time series data set, 5 million observations of a single variable X.
R(T) = SUM ( X - mean(X) ) * ( LagT(X) - Mean(X) ) / (N - T)
must be computed and summed for each lag times T=1,2,3,4,.....,N. hence the 5million columns
I dont want to create 5,000,000 columns, its my poor coding knowledge of a better way.
but i need the compute the above equation R(1), R(2), .... R(5,000,000).
So all i really need is just a two column data set with variables 'X' and 'R(T)' with 5 million observations.
Any ideas how this could be done?
Do you have SAS ETS?
Can you use proc timeseries and then deal with the output from there rather than a datastep and macro code?
thanks it seems
data data_set (drop = i);
do i = 1 to 200;
X = rand('NORMAL',0,1);
output;
end;
run;
data data_set;
set data_Set;
obsN = _N_;
run;
proc timeseries data=data_set
out=out
outcorr=timedomain;
id obsN interval=day;
var X;
run;
works well, but it only gives the first 20 rows for the outcorr=timedomain whereas I need all of the rows.
I need it to find R(1), R(2), ... R(200),
Is there some way to specifiy how many rows the 'outcorr=<>' creats in the dataset.
skipper,
First, you have to realize you will need some powerful hardware. This problem requires trillions of computations. Assuming you have room to store the results, this approach creates a data set with trillions of observations but only 2 variables: T and the contribution to R(T). From that point, the data will need to get summed up.
Compute your macro variables as before (&MX and &CX). Then take each relevant pair of observations:
data want (keep=T RT);
set have (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;
set have (rename=(x = lagged_x)) point=i;
T = _n_ - i;
RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);
output;
end;
run;
I think I got the formulas right. Naturally, you'll want to test this on a small data set first!
Good luck.
An important afterthought ...
If all you need is the final sum, you don't have to store the trillions of observations. Instead:
data want (keep=T RT) / view=want;
The rest of the DATA step is the same, then
proc means data=want sum;
var RT;
Or, add an OUTPUT statement to the PROC MEANS to save the result in a data set.
But this might not be applicable if you need to save the separate R(T) for each value of T.
I dont understand, like
data data_set (drop = i);
do i = 1 to 200;
X = rand('NORMAL',0,1);
output;
end;
run;
proc sql;
select mean(x) into :mx from data_set;
select count(x) into :cx from data_set;
quit;
data want2 (keep=T RT) / view=want2;
set data_set (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;
set data_set (rename=(x = lagged_x)) point=i;
T = _n_ - i;
RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);
output;
end;
run;
proc means data=want2 sum;
var RT;
produces only 123 rows? it should be R(1) to R(200)?
With 200 records, wouldn't you expect 199 values for T?
But this PROC MEANS should generate only 1 row, with the sum of RT across all values of T.
looks interesting,... but
data data_set (drop = i);
do i = 1 to 200;
X = rand('NORMAL',0,1);
output;
end;
run;
proc sql;
select mean(x) into :mx from data_set;
select count(x) into :cx from data_set;
quit;
data want (keep=T RT);
set data_set (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;
set data_set (rename=(x = lagged_x)) point=i;
T = _n_ - i;
RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);
output;
end;
run;
creates a table with 19,900 observations!..
it should only have 200 observations, vector X has size 200, so it should be T = 1, 2, 3, 4, .... 200 and R(1), R(2), ... R(200).
Yes, that sounds about right. This is getting the contribution to R(T) for each pair of observations. You still need to sum it up. For your small data set, you could use:
proc means data=want sum;
var RT;
class T;
run;
With larger numbers of T values, it won't be so easy.
so this code
data data_set (drop = i);
do i = 1 to 20000;
X = rand('NORMAL',0,1);
output;
end;
run;
proc sql;
select mean(x) into :mx from data_set;
select count(x) into :cx from data_set;
quit;
data want (keep=T RT) / view=want;
set data_set (keep=x);
if _n_ > 1 then do i = 1 to _n_ - 1;
set data_set (rename=(x = lagged_x)) point=i;
T = _n_ - i;
RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);
output;
end;
run;
proc means data=want sum;
OUTPUT OUT=want3;
var RT;
class T;
run;
data want4;
set want3;
where _STAT_="MEAN" and T > 0;
keep T RT;
run;
gives me what I want, but it took 6 minutes to compute, and thats only for 20,000 observations, which is not large enough I need at least 2 million.
It will take a while no matter how you do it. Remember how 200 observations morphed into 19,900? You are getting similar expansion for the 20,000 records. The LOG indicates how many observations are being read in by PROC MEANS. As the number of observations increases, PROC MEANS will eventually fail because there will be too many values of T.
Note that you should modify your final PROC MEANS to eliminate the need for a subsequent DATA step:
proc means data=want NWAY;
var RT;
class T;
output out=want3 (keep=T RT) sum=;
run;
But that is not the time-consuming part. The number of calculations increases roughly proportional to the square of the number of observations. Like I said, you will end up in the trillions by the time you are done. I don't know if sorting that many observations would be possible ... if so, it would allow PROC MEANS to switch from a CLASS to a BY statement, which would allow PROC MEANS to run. How long it would take is anybody's guess. But sorting the trillions of records would certainly take longer than running PROC MEANS.
This program ran without error for me:
data _null_;
array t {5000000};
x=1;
run;
It took about 2 minutes, but it ran. That gives you an alternative way to sum up the pieces from the view named WANT:
data final;
array t_totals {4999999};
set want end=done;
t_totals{t} + rt;
if done then do T=1 to 4999999;
RT = t_totals{T};
output;
end;
keep T RT;
run;
This will still crank for a while since it has to process trillions of records. But if the top DATA step works, the bottom one will work as well.
Good luck!
OK, final afterthought.
It should be possible to combine all the processing into a single DATA step (if the array will fit in memory). Defining a temporary array will speed things up:
data want;
array t_totals {4999999} _temporary_;
set have (keep=x) end=done;
if _n_ > 1 then do i = 1 to _n_ - 1;
set have (keep=x rename=(x = lagged_x)) point=i;
T = _n_ - i;
t_totals{T} + (x - &mx) * (lagged_x - &mx) / (&cx - T);
end;
if done;
do T = 1 to 4999999;
RT = t_totals{T};
output;
end;
keep T RT;
run;
OK, I'm pretty sure I'm done here and can't speed it up beyond that.
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.