BookmarkSubscribeRSS Feed
Wafflecakes
Calcite | Level 5

 Hi,

 

Suppose I have the following dataset

 

id year measure1 measure2 

1 2000 0.41 -

1 2001 0.19 0.50

1 2002 0.51 0.75

1 2003 0.91 0.29

2 2000 0.69 - 

2 2001 0.40 0.69

2 2002 0.69 0.29

2 2003 0.79 0.39

How would I do a correlation between consecutive years (grouped by id)? e.g. Correlation between measure1 in 2001 with measure1 in 2000, measure1 in 2002 with measure1 in 2001, etc.

 

Thank you,

 

14 REPLIES 14
SASKiwi
PROC Star

@Wafflecakes  - What is your definition of "correlation"?

Wafflecakes
Calcite | Level 5
Thank you for your question SASKiwi. Pearson's correlation as both measure1 and measure2 are continuous.
Reeza
Super User
You can look at autocorrelation which is typically used for time series analysis. Note the formula is slightly different than regular correlation if you're trying to verify answers or if you don't have SAS/ETS.
PGStats
Opal | Level 21


If you have access to SAS/ETS, you could use proc autoreg, as in:

 


proc autoreg data=have;
class id;
model measure1 = id / nlag=1;
run;
PG
Wafflecakes
Calcite | Level 5

Thank you PGStats. It seems to me that the autoreg procedure is for running a linear regression model for time series data as opposed to correlation? Can you elaborate?

 

 

PGStats
Opal | Level 21

Yes, autoreg performs regressions. But it relaxes the requirement that errors (residuals) be independent. Instead, the errors may be serially correlated. Part of the method involves estimating the correlation that exists between consecutive observations. This is the measure you are looking for.

 

Specifying the model as measure1 = id gets the procedure to remove the mean from each id series to get the residuals.

PG
Wafflecakes
Calcite | Level 5

Screen Shot 2020-02-11 at 10.02.25 PM.png

 

Unfortunately I get the above error message when I try to use proc autoreg in SAS version 9.4. Is there a workaround to this? Perhaps another pearson's correlation method? 

Wafflecakes
Calcite | Level 5

I have tried proc corr as an alternative to this approach, but it did not give me the intended result. 

 

Essentially, my data is structured as follows...I created x2000, x2001, x2002, and x2003 to equal x in the corresponding years.

 

id year x x2000 x2001 x2002 x2003

1 2000 0.59 0.59 NA NA NA

1 2001 0.69 NA 0.69 NA NA

1 2002 0.19 NA NA 0.19 NA

1 2003 0.39 NA NA NA 0.39

 

I used the following code to calculate correlations between consecutive years with this code:

proc corr data = dataset; 

var x2000 x2001;

run; 

 

However, when I run the code, I do not get the pearson's correlations between years x2000 and x2001. 

 

Does anybody have any recommendations?

Reeza
Super User

Your data isn't structured to be analyzed in that form. The NA's alone mean character unless you're doing that solely for the forum. 

But I can only see what you post so I have to go off what you've shown. 

If you look at PROC CORR and the examples, which have the full code you can see how your data should be structured for correlations. 

 


@Wafflecakes wrote:

I have tried proc corr as an alternative to this approach, but it did not give me the intended result. 

 

Essentially, my data is structured as follows...I created x2000, x2001, x2002, and x2003 to equal x in the corresponding years.

 

id year x x2000 x2001 x2002 x2003

1 2000 0.59 0.59 NA NA NA

1 2001 0.69 NA 0.69 NA NA

1 2002 0.19 NA NA 0.19 NA

1 2003 0.39 NA NA NA 0.39

 

I used the following code to calculate correlations between consecutive years with this code:

proc corr data = dataset; 

var x2000 x2001;

run; 

 

However, when I run the code, I do not get the pearson's correlations between years x2000 and x2001. 

 

Does anybody have any recommendations?


 

Wafflecakes
Calcite | Level 5

Thank you, Reeza. Correct - the NA was done on purpose to represent missing data. May you suggest an example? I do not see any that would match my case and do not understand why proc corr would not work with the way I have structured my data. 

Reeza
Super User
It will not work because SAS removes rows that have missing values from the calculation which leaves you with a single observation to calculate the correlation which isn't possible mathematically.
Wafflecakes
Calcite | Level 5

Thank you, Reeza, for your recommendation. 

 

I coded it as follows: 

 

proc timeseries data = test out = out outcorr = timedomain;

by id; 

corr /nlag = 1 

var measure1; 

run; 

 

I am not quite sure what the purpose of outcorr is and whether I have specified the nlag correctly (I am trying to do a correlation between the value of measure1 in each year with the corresponding previous year). 

 

In the timedomain dataset that is outputted, which specific variable actually tells me what the correlation is? 

 

Wafflecakes
Calcite | Level 5

Hello, 

Following up on this. I wonder if anyone has any thoughts about how to calculate these correlations?

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 14 replies
  • 2131 views
  • 6 likes
  • 4 in conversation