BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
skipper
Calcite | Level 5

HI, I have code to compute 'N' means, in my code N=200.

data data_set (drop = i);

   do i = 1 to 200;

      X = rand('NORMAL',0,1);

      output;

   end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

%macro compute;

data data_set;

  set data_set;

  %do n=1 %to &cx -1 ;

  r&n = (X - &mx)*(lag&n(X) - &mx) ;

  %end;

run;

%mend;

%compute

PROC MEANS DATA=data_set mean;

OUTPUT OUT=want;

run;

**************************************

BUT, in my real data, N=5,000,000 !!

(and the above code becomes to slow with N>20,000.)

How can this be done on such large 'N' ?

1 ACCEPTED SOLUTION

Accepted Solutions
Astounding
PROC Star

This program ran without error for me:

data _null_;

array t {5000000};

x=1;

run;

It took about 2 minutes, but it ran.  That gives you an alternative way to sum up the pieces from the view named WANT:

data final;

   array t_totals {4999999};

   set want end=done;

   t_totals{t} + rt;

   if done then do T=1 to 4999999;

      RT = t_totals{T};

      output;

   end;

   keep T RT;

run;

This will still crank for a while since it has to process trillions of records.  But if the top DATA step works, the bottom one will work as well.

Good luck!

View solution in original post

15 REPLIES 15
Doc_Duke
Rhodochrosite | Level 12

Skipper,

Tell us what you are trying to do.

This looks like you are creating "N" new columns in data_set.

%do n=1 %to &cx -1 ;

  r&n = (X - &mx)*(lag&n(X) - &mx) ;

  %end;

If my reading of your code is correct, SAS will not create a dataset with 5,000,000 columns, so this approach won't work.

You will probably be better off creating a dataset with lots of short rows and grouping somehow, but without knowing more about the goal, it is hard to help.

Doc Muhlbaier

Duke

skipper
Calcite | Level 5

Trying to compute the 'autocorrelations' for time series data set, 5 million observations of a single variable X.

R(T) = SUM ( X - mean(X) ) * ( LagT(X) - Mean(X) ) / (N - T)

must be computed and summed for each lag times T=1,2,3,4,.....,N. hence the 5million columns

I dont want to create 5,000,000 columns, its my poor coding knowledge of a better way.

but i need the compute the above equation R(1), R(2), .... R(5,000,000).

So all i really need is just a two column data set with variables 'X' and  'R(T)' with 5 million observations.

Any ideas how this could be done?

Reeza
Super User

Do you have SAS ETS?

Can you use proc timeseries and then deal with the output from there rather than a datastep and macro code?

skipper
Calcite | Level 5

thanks it seems

data data_set (drop = i);

   do i = 1 to 200;

      X = rand('NORMAL',0,1);

      output;

   end;

run;

data data_set;

set data_Set;

obsN = _N_;

run;

proc timeseries data=data_set

                   out=out

                   outcorr=timedomain;

      id obsN interval=day;

      var X;

run;

works well, but it only gives the first 20 rows for the outcorr=timedomain whereas I need all of the rows.

I need it to find R(1), R(2), ... R(200),

Is there some way to specifiy how many rows the 'outcorr=<>' creats in the dataset.

Astounding
PROC Star

skipper,

First, you have to realize you will need some powerful hardware.  This problem requires trillions of computations.  Assuming you have room to store the results, this approach creates a data set with trillions of observations but only 2 variables:  T and the contribution to R(T).  From that point, the data will need to get summed up.

Compute your macro variables as before (&MX and &CX).  Then take each relevant pair of observations:

data want (keep=T RT);

   set have (keep=x);
   if _n_ > 1 then do i = 1 to _n_ - 1;

     set have (rename=(x = lagged_x)) point=i;

     T = _n_ - i;

     RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

     output;

   end;

run;

I think I got the formulas right.  Naturally, you'll want to test this on a small data set first!

Good luck.

Astounding
PROC Star

An important afterthought ...

If all you need is the final sum, you don't have to store the trillions of observations.  Instead:

data want (keep=T RT) / view=want;

The rest of the DATA step is the same, then

proc means data=want sum;

   var RT;

Or, add an OUTPUT statement to the PROC MEANS to save the result in a data set.

But this might not be applicable if you need to save the separate R(T) for each value of T.

skipper
Calcite | Level 5

I dont understand, like

data data_set (drop = i);

   do i = 1 to 200;

      X = rand('NORMAL',0,1);

      output;

   end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want2 (keep=T RT) / view=want2;

   set data_set (keep=x);

   if _n_ > 1 then do i = 1 to _n_ - 1;

     set data_set (rename=(x = lagged_x)) point=i;

     T = _n_ - i;

     RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

     output;

   end;

run;

proc means data=want2 sum;

   var RT;

produces only 123 rows? it should be R(1) to R(200)?

Astounding
PROC Star

With 200 records, wouldn't you expect 199 values for T?

But this PROC MEANS should generate only 1 row, with the sum of RT across all values of T.

skipper
Calcite | Level 5

looks interesting,... but

data data_set (drop = i);

   do i = 1 to 200;

      X = rand('NORMAL',0,1);

      output;

   end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT);

   set data_set (keep=x);

   if _n_ > 1 then do i = 1 to _n_ - 1;

     set data_set (rename=(x = lagged_x)) point=i;

     T = _n_ - i;

     RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

     output;

   end;

run;

creates a table with 19,900 observations!..

it should only have 200 observations, vector X has size 200, so it should be T = 1, 2, 3, 4, .... 200 and R(1), R(2), ... R(200).

Astounding
PROC Star

Yes, that sounds about right.  This is getting the contribution to R(T) for each pair of observations.  You still need to sum it up.  For your small data set, you could use:

proc means data=want sum;

   var RT;

   class T;

run;

With larger numbers of T values, it won't be so easy.

skipper
Calcite | Level 5

so this code

data data_set (drop = i);

   do i = 1 to 20000;

      X = rand('NORMAL',0,1);

      output;

   end;

run;

proc sql;

select mean(x) into :mx from data_set;

select count(x) into :cx from data_set;

quit;

data want (keep=T RT) / view=want;

   set data_set (keep=x);

   if _n_ > 1 then do i = 1 to _n_ - 1;

     set data_set (rename=(x = lagged_x)) point=i;

     T = _n_ - i;

     RT = (x - &mx) * (lagged_x - &mx) / (&cx - T);

     output;

   end;

run;

proc means data=want sum;

OUTPUT OUT=want3;

   var RT;

   class T;

run;

data want4;

set want3;

where _STAT_="MEAN" and T > 0;

keep T RT;

run;

gives me what I want, but it took 6 minutes to compute, and thats only for 20,000 observations, which is not large enough I need at least 2 million.

Astounding
PROC Star

It will take a while no matter how you do it.  Remember how 200 observations morphed into 19,900?  You are getting similar expansion for the 20,000 records.  The LOG indicates how many observations are being read in by PROC MEANS.  As the number of observations increases, PROC MEANS will eventually fail because there will be too many values of T.

Note that you should modify your final PROC MEANS to eliminate the need for a subsequent DATA step:

proc means data=want NWAY;

   var RT;

   class T;

   output out=want3 (keep=T RT) sum=;

run;

But that is not the time-consuming part.  The number of calculations increases roughly proportional to the square of the number of observations.  Like I said, you will end up in the trillions by the time you are done.  I don't know if sorting that many observations would be possible ... if so, it would allow PROC MEANS to switch from a CLASS to a BY statement, which would allow PROC MEANS to run.  How long it would take is anybody's guess.  But sorting the trillions of records would certainly take longer than running PROC MEANS.

Astounding
PROC Star

This program ran without error for me:

data _null_;

array t {5000000};

x=1;

run;

It took about 2 minutes, but it ran.  That gives you an alternative way to sum up the pieces from the view named WANT:

data final;

   array t_totals {4999999};

   set want end=done;

   t_totals{t} + rt;

   if done then do T=1 to 4999999;

      RT = t_totals{T};

      output;

   end;

   keep T RT;

run;

This will still crank for a while since it has to process trillions of records.  But if the top DATA step works, the bottom one will work as well.

Good luck!

Astounding
PROC Star

OK, final afterthought.

It should be possible to combine all the processing into a single DATA step (if the array will fit in memory).  Defining a temporary array will speed things up:

data want;

   array t_totals {4999999} _temporary_;

   set have (keep=x) end=done;

   if _n_ > 1 then do i = 1 to _n_ - 1;

     set have (keep=x rename=(x = lagged_x)) point=i;

     T = _n_ - i;

     t_totals{T} + (x - &mx) * (lagged_x - &mx) / (&cx - T);

   end;

   if done;

   do T = 1 to 4999999;

       RT = t_totals{T};

       output;

   end;

   keep T RT;

run;

OK, I'm pretty sure I'm done here and can't speed it up beyond that.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 15 replies
  • 3748 views
  • 6 likes
  • 4 in conversation