I recently ran a simulation study that calculated a correlation coefficient between two variables across 500 replications. Now I'm trying to figure out how many replications were needed to achieve convergence, where the difference in the correlation coefficient between one replication and then next is less than or equal to .01. To do this, I created a variable to my dataset that calculates a cumulative Pearson correlation coefficient (called cum_r). Now I'd like to add another variable that calculates the difference between the cumulative correlation coefficient of the current observation and the cumulative correlation coefficient of the prior observation. This is the code that I have so far, which is working
data results_cumcorr;
set results;
cum_y + y;
cum_ysq + y**2;
cum_x + x;
cum_xsq + x**2;
cum_cov + y*x;
cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
run;
(I'm using SAS v. 9.4.)
Please let me know if you have any ideas on how to do this.
Thanks!
You need a DO loop in the data step, and you need to store the value of the correlation coefficient at each iteration so you can compare it to the value at the next iteration.
But I'd need to see a portion of the data set RESULTS, and I'd need to have a better explanation of what you consider a replication in this data set before I could write an example.
Hi Paige,
Thanks for your help with this. Each replication is a row/observation of data. Yes, I do store the value of the correlation coefficient. That's the cum_r variable. Here's an example of what my data look like. These are the first 10 obs:
rep | y | cum_y | cum_ysq | x | cum_x | cum_x | cum_cov | cum_r |
1 | 0.14039 | 0.14039 | 0.01971 | 0.19183 | 0.19183 | 0.0368 | 0.02693 | . |
2 | 0.01013 | 0.15052 | 0.01981 | 0.13868 | 0.33051 | 0.05603 | 0.02834 | 1 |
3 | 0.1423 | 0.29282 | 0.04006 | 0.20102 | 0.53153 | 0.09644 | 0.05694 | 0.99226 |
4 | 0.06015 | 0.35296 | 0.04368 | 0.33751 | 0.86904 | 0.21035 | 0.07724 | 0.03378 |
5 | 0.08183 | 0.43479 | 0.05037 | 0.17983 | 1.04887 | 0.24269 | 0.09196 | 0.04427 |
6 | 0.17327 | 0.60807 | 0.0804 | 0.2296 | 1.27847 | 0.29541 | 0.13174 | 0.10459 |
7 | 0.23369 | 0.84176 | 0.13501 | 0.30227 | 1.58074 | 0.38678 | 0.20238 | 0.38728 |
8 | 0.16508 | 1.00684 | 0.16226 | 0.32945 | 1.91019 | 0.49531 | 0.25676 | 0.43812 |
9 | 0.16022 | 1.16706 | 0.18793 | 0.19941 | 2.1096 | 0.53507 | 0.28871 | 0.3932 |
10 | 0.13209 | 1.29915 | 0.20538 | 0.22111 | 2.3307 | 0.58396 | 0.31792 | 0.39166 |
I'd essentially like another variable dif_r that is the absolute difference between the cumulative r in the current row and the cumulative r in the prior row. So for the first 10 rows it would look like:
rep | y | cum_y | cum_ysq | x | cum_x | cum_x | cum_cov | cum_r | diff_r |
1 | 0.14039 | 0.14039 | 0.01971 | 0.19183 | 0.19183 | 0.0368 | 0.02693 | . | . |
2 | 0.01013 | 0.15052 | 0.01981 | 0.13868 | 0.33051 | 0.05603 | 0.02834 | 1 | . |
3 | 0.1423 | 0.29282 | 0.04006 | 0.20102 | 0.53153 | 0.09644 | 0.05694 | 0.99226 | 0.00774 |
4 | 0.06015 | 0.35296 | 0.04368 | 0.33751 | 0.86904 | 0.21035 | 0.07724 | 0.03378 | 0.95848 |
5 | 0.08183 | 0.43479 | 0.05037 | 0.17983 | 1.04887 | 0.24269 | 0.09196 | 0.04427 | 0.01049 |
6 | 0.17327 | 0.60807 | 0.0804 | 0.2296 | 1.27847 | 0.29541 | 0.13174 | 0.10459 | 0.06032 |
7 | 0.23369 | 0.84176 | 0.13501 | 0.30227 | 1.58074 | 0.38678 | 0.20238 | 0.38728 | 0.28269 |
8 | 0.16508 | 1.00684 | 0.16226 | 0.32945 | 1.91019 | 0.49531 | 0.25676 | 0.43812 | 0.05084 |
9 | 0.16022 | 1.16706 | 0.18793 | 0.19941 | 2.1096 | 0.53507 | 0.28871 | 0.3932 | 0.04492 |
10 | 0.13209 | 1.29915 | 0.20538 | 0.22111 | 2.3307 | 0.58396 | 0.31792 | 0.39166 | 0.00154 |
Thanks again!
I guess data set RESULTS has 500 observations, and you want all the observations written to RESULTS_CUMCORR up to and including the first one that is within .01 of the CUM_R of its predecessor. You can use the DIF function to compare one instance of cum_r to its predecessor.
data results_cumcorr;
set results;
cum_y + y;
cum_ysq + y**2;
cum_x + x;cum_xsq + x**2;
cum_cov + y*x;
cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
output;
if _n_>1 and abs(dif(cum_r))<=.01 then stop;
run;
The DIF function is defined as dif(x)=x-lag(x). Since the very first observation has a missing lag(cum_r) value, its value of dif(cum_r) would also be missing, which is "less than" all valid numeric values. That's why the stop criterion is "if _n_>1 and dif(cum_f)<=.01".
Edit: changed "dif(cum_r)" to abs(dif(cum_r)).
I really like this idea! I will try it. Thanks!
Hi Mkeintz,
I'm getting closer. I did not know about the dif function, so that is very helpful! I needed to adapt your code, because I was simplifying my data for the purpose of my question, and I actually have 24 correlations. Here is my code now:
data results_cumcorr;
set results;
cum_y + y;
cum_ysq + y**2;
array x x01 - x24;
array cum_x cum_x01 - cum_x24;
array cum_xsq cum_xsq01 - cum_xsq24;
array cum_cov cov_01 - cov_24;
array cum_r cumr_01 -cumr_24;
array con con_01 - con_24;
do over x;
cum_x + x;
cum_xsq + x**2;
cum_cov + y*x;
cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
if _n_>1 and dif(cum_r)<=.01 then con=1; else con=0;
end;
run;
With this new line of code, I have 24 new variables that show me whether convergence has been met (1) or not (0). I thought about doing a proc freq to figure out when convergence has been met for each variable, but the problem is that it doesn't necessarily converge the first rep for which con=1. For example, in this instance, it appeared that convergence had been met but then the change in correlation went above .01 for rep 203. If I had just asked for it to stop when con=1, then it would not have caught that instance. (And I'd have the same issue with doing a proc freq)
rep | con |
200 | 1 |
201 | 1 |
202 | 1 |
203 | 0 |
204 | 1 |
Is there a way for me to ask SAS to output the last rep for which con=0?
Thank you!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.