BookmarkSubscribeRSS Feed
AB85
Fluorite | Level 6

I recently ran a simulation study that calculated a correlation coefficient between two variables across 500 replications. Now I'm trying to figure out how many replications were needed to achieve convergence, where the difference in the correlation coefficient between one replication and then next is less than or equal to .01. To do this, I created a variable to my dataset that calculates a cumulative Pearson correlation coefficient (called cum_r). Now I'd like to add another variable that calculates the difference between the cumulative correlation coefficient of the current observation and the cumulative correlation coefficient of the prior observation. This is the code that I have so far, which is working 

 

data results_cumcorr;
set results;
cum_y + y;
cum_ysq + y**2;
cum_x + x;
cum_xsq + x**2; cum_cov + y*x; cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2)); run;

(I'm using SAS v. 9.4.)

 

Please let me know if you have any ideas on how to do this.

 

Thanks!

5 REPLIES 5
PaigeMiller
Diamond | Level 26

You need a DO loop in the data step, and you need to store the value of the correlation coefficient at each iteration so you can compare it to the value at the next iteration.

 

But I'd need to see a portion of the data set RESULTS, and I'd need to have a better explanation of what you consider a replication in this data set before I could write an example.

--
Paige Miller
AB85
Fluorite | Level 6

Hi Paige,

 

Thanks for your help with this. Each replication is a row/observation of data. Yes, I do store the value of the correlation coefficient. That's the cum_r variable. Here's an example of what my data look like. These are the first 10 obs:

 

repycum_ycum_ysqxcum_xcum_xcum_covcum_r
10.140390.140390.019710.191830.191830.03680.02693.
20.010130.150520.019810.138680.330510.056030.028341
30.14230.292820.040060.201020.531530.096440.056940.99226
40.060150.352960.043680.337510.869040.210350.077240.03378
50.081830.434790.050370.179831.048870.242690.091960.04427
60.173270.608070.08040.22961.278470.295410.131740.10459
70.233690.841760.135010.302271.580740.386780.202380.38728
80.165081.006840.162260.329451.910190.495310.256760.43812
90.160221.167060.187930.199412.10960.535070.288710.3932
100.132091.299150.205380.221112.33070.583960.317920.39166

 

I'd essentially like another variable dif_r that is the absolute difference between the cumulative r in the current row and the cumulative r in the prior row. So for the first 10 rows it would look like:

 

repycum_ycum_ysqxcum_xcum_xcum_covcum_rdiff_r
10.140390.140390.019710.191830.191830.03680.02693..
20.010130.150520.019810.138680.330510.056030.028341.
30.14230.292820.040060.201020.531530.096440.056940.992260.00774
40.060150.352960.043680.337510.869040.210350.077240.033780.95848
50.081830.434790.050370.179831.048870.242690.091960.044270.01049
60.173270.608070.08040.22961.278470.295410.131740.104590.06032
70.233690.841760.135010.302271.580740.386780.202380.387280.28269
80.165081.006840.162260.329451.910190.495310.256760.438120.05084
90.160221.167060.187930.199412.10960.535070.288710.39320.04492
100.132091.299150.205380.221112.33070.583960.317920.391660.00154

 

Thanks again!

mkeintz
PROC Star

I guess data set RESULTS has 500 observations, and you want all the observations written to RESULTS_CUMCORR up to and including the first one that is within .01 of the CUM_R of its predecessor.  You can use the DIF function to compare one instance of cum_r to its predecessor.

 

data results_cumcorr;
  set results;
  cum_y + y;
  cum_ysq + y**2;
  cum_x + x;cum_xsq + x**2;
  cum_cov + y*x;
  cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));

  output;
  if _n_>1 and abs(dif(cum_r))<=.01 then stop;
run;

 

 

The DIF function is defined as dif(x)=x-lag(x).  Since the very first observation has a missing lag(cum_r) value, its value of dif(cum_r) would also be missing, which is "less than" all valid numeric values.  That's why the stop criterion is "if _n_>1 and dif(cum_f)<=.01".

 

Edit: changed "dif(cum_r)" to abs(dif(cum_r)).

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
AB85
Fluorite | Level 6

I really like this idea! I will try it. Thanks!

AB85
Fluorite | Level 6

Hi Mkeintz,

 

I'm getting closer. I did not know about the dif function, so that is very helpful! I needed to adapt your code, because I was simplifying my data for the purpose of my question, and I actually have 24 correlations. Here is my code now:

 

data results_cumcorr;
  set results;
  cum_y + y;
  cum_ysq + y**2;
array x x01 - x24;
array cum_x cum_x01 - cum_x24;
array cum_xsq cum_xsq01 - cum_xsq24;
array cum_cov cov_01 - cov_24;
array cum_r cumr_01 -cumr_24;
array con con_01 - con_24; do over x; cum_x + x; cum_xsq + x**2; cum_cov + y*x; cum_r = ((_n_*cum_cov)-(cum_x*cum_y))/sqrt(((_n_*cum_xsq)-cum_x**2)*((_n_*cum_ysq)-cum_y**2));
if _n_>1 and dif(cum_r)<=.01 then con=1; else con=0; end; run;

With this new line of code, I have 24 new variables that show me whether convergence has been met (1) or not (0). I thought about doing a proc freq to figure out when convergence has been met for each variable, but the problem is that it doesn't necessarily converge the first rep for which con=1. For example, in this instance, it appeared that convergence had been met but then the change in correlation went above .01 for rep 203. If I had just asked for it to stop when con=1, then it would not have caught that instance. (And I'd have the same issue with doing a proc freq)

 

repcon
2001
2011
2021
2030
2041

 

Is there a way for me to ask SAS to output the last rep for which con=0?

 

Thank you!

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 1106 views
  • 0 likes
  • 3 in conversation