How do I calculate a difference in cumulative sums between observation...

AB85 · Posted 01-13-2019 11:56 AM

I recently ran a simulation study that calculated a correlation coefficient between two variables across 500 replications. Now I'm trying to figure out how many replications were needed to achieve convergence, where the difference in the correlation coefficient between one replication and then next is less than or equal to .01. To do this, I created a variable to my dataset that calculates a cumulative Pearson correlation coefficient (called cum_r). Now I'd like to add another variable that calculates the difference between the cumulative correlation coefficient of the current observation and the cumulative correlation coefficient of the prior observation. This is the code that I have so far, which is working

data results_cumcorr;
set results;
cum_y + y;
cum_ysq + y**2;
cum_x + x;
cum_xsq + x**2;
cum_cov + y*x;
cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
run;

(I'm using SAS v. 9.4.)

Please let me know if you have any ideas on how to do this.

Thanks!

PaigeMiller · Posted 01-13-2019 12:26 PM

You need a DO loop in the data step, and you need to store the value of the correlation coefficient at each iteration so you can compare it to the value at the next iteration.

But I'd need to see a portion of the data set RESULTS, and I'd need to have a better explanation of what you consider a replication in this data set before I could write an example.

--
Paige Miller

AB85 · Posted 01-13-2019 01:13 PM

Hi Paige,

Thanks for your help with this. Each replication is a row/observation of data. Yes, I do store the value of the correlation coefficient. That's the cum_r variable. Here's an example of what my data look like. These are the first 10 obs:

rep	y	cum_y	cum_ysq	x	cum_x	cum_x	cum_cov	cum_r
1	0.14039	0.14039	0.01971	0.19183	0.19183	0.0368	0.02693	.
2	0.01013	0.15052	0.01981	0.13868	0.33051	0.05603	0.02834	1
3	0.1423	0.29282	0.04006	0.20102	0.53153	0.09644	0.05694	0.99226
4	0.06015	0.35296	0.04368	0.33751	0.86904	0.21035	0.07724	0.03378
5	0.08183	0.43479	0.05037	0.17983	1.04887	0.24269	0.09196	0.04427
6	0.17327	0.60807	0.0804	0.2296	1.27847	0.29541	0.13174	0.10459
7	0.23369	0.84176	0.13501	0.30227	1.58074	0.38678	0.20238	0.38728
8	0.16508	1.00684	0.16226	0.32945	1.91019	0.49531	0.25676	0.43812
9	0.16022	1.16706	0.18793	0.19941	2.1096	0.53507	0.28871	0.3932
10	0.13209	1.29915	0.20538	0.22111	2.3307	0.58396	0.31792	0.39166

I'd essentially like another variable dif_r that is the absolute difference between the cumulative r in the current row and the cumulative r in the prior row. So for the first 10 rows it would look like:

rep	y	cum_y	cum_ysq	x	cum_x	cum_x	cum_cov	cum_r	diff_r
1	0.14039	0.14039	0.01971	0.19183	0.19183	0.0368	0.02693	.	.
2	0.01013	0.15052	0.01981	0.13868	0.33051	0.05603	0.02834	1	.
3	0.1423	0.29282	0.04006	0.20102	0.53153	0.09644	0.05694	0.99226	0.00774
4	0.06015	0.35296	0.04368	0.33751	0.86904	0.21035	0.07724	0.03378	0.95848
5	0.08183	0.43479	0.05037	0.17983	1.04887	0.24269	0.09196	0.04427	0.01049
6	0.17327	0.60807	0.0804	0.2296	1.27847	0.29541	0.13174	0.10459	0.06032
7	0.23369	0.84176	0.13501	0.30227	1.58074	0.38678	0.20238	0.38728	0.28269
8	0.16508	1.00684	0.16226	0.32945	1.91019	0.49531	0.25676	0.43812	0.05084
9	0.16022	1.16706	0.18793	0.19941	2.1096	0.53507	0.28871	0.3932	0.04492
10	0.13209	1.29915	0.20538	0.22111	2.3307	0.58396	0.31792	0.39166	0.00154

Thanks again!

mkeintz · Posted 01-13-2019 01:18 PM

I guess data set RESULTS has 500 observations, and you want all the observations written to RESULTS_CUMCORR up to and including the first one that is within .01 of the CUM_R of its predecessor. You can use the DIF function to compare one instance of cum_r to its predecessor.

data results_cumcorr;
  set results;
  cum_y + y;
  cum_ysq + y**2;
  cum_x + x;cum_xsq + x**2;
  cum_cov + y*x;
  cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));

  output;
  if _n_>1 and abs(dif(cum_r))<=.01 then stop;
run;

The DIF function is defined as dif(x)=x-lag(x). Since the very first observation has a missing lag(cum_r) value, its value of dif(cum_r) would also be missing, which is "less than" all valid numeric values. That's why the stop criterion is "if _n_>1 and dif(cum_f)<=.01".

Edit: changed "dif(cum_r)" to abs(dif(cum_r)).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AB85 · Posted 01-13-2019 01:24 PM

I really like this idea! I will try it. Thanks!

AB85 · Posted 01-13-2019 02:06 PM

Hi Mkeintz,

I'm getting closer. I did not know about the dif function, so that is very helpful! I needed to adapt your code, because I was simplifying my data for the purpose of my question, and I actually have 24 correlations. Here is my code now:

data results_cumcorr;
  set results;
  cum_y + y;
  cum_ysq + y**2;
array x x01 - x24;
array cum_x cum_x01 - cum_x24;
array cum_xsq cum_xsq01 - cum_xsq24;
array cum_cov cov_01 - cov_24;
array cum_r cumr_01 -cumr_24;
array con con_01 - con_24;
do over x;
  cum_x + x;
cum_xsq + x**2;
  cum_cov + y*x;
  cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
if _n_>1 and dif(cum_r)<=.01 then con=1; else con=0;
end;
run;

With this new line of code, I have 24 new variables that show me whether convergence has been met (1) or not (0). I thought about doing a proc freq to figure out when convergence has been met for each variable, but the problem is that it doesn't necessarily converge the first rep for which con=1. For example, in this instance, it appeared that convergence had been met but then the change in correlation went above .01 for rep 203. If I had just asked for it to stop when con=1, then it would not have caught that instance. (And I'd have the same issue with doing a proc freq)

rep	con
200	1
201	1
202	1
203	0
204	1

Is there a way for me to ask SAS to output the last rep for which con=0?

Thank you!

How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Catch up on SAS Innovate 2026

How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Re: How do I calculate a difference in cumulative sums between observations?

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away