turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Calculating the correlation in a long data set

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-07-2016 09:41 AM

When analyzing longitudinal data, we usually have our datasets in long format, as is required for, say, PROC GENMOD or PROC GLIMMIX. A major component of longitudinal data analysis is understanding the correlation structure of the data you are trying to model. To that end, it is often necessary and informative to compare the working correlation matrix (generated by, say, the repeated statement in PROC GENMOD) with the "observed" correlation matrix.

However, to my knowledge, the only way to calculate correlations in SAS is with a WIDE dataset, using PROC CORR (or manually with a DATA step). While this is fairly simple, it still seems less than ideal (not to mention inefficient) to have to create a transposed version of my analysis dataset just for this one calculation, while everything else (models and plotting functions especially) use the long dataset. And this transposition may not be completely simple if we have a more complicated data structure (say, a hierarchical/nested one).

Is there any straightforward way of calculating the correlation for a single variable across different levels of a class variable that does not involve creating a wide version of the dataset for PROC CORR?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-07-2016 10:32 AM

In general, I don't think so. A long data set is typically used when you have levels of categorical variables that are unequal in size. You might have n1 females and n2 males. The first female value is not related to the first male value. It doesn't make sense to combine them into an "observation" that has variables "Y_Female" and "Y_Male." If n1 is not equal to n2, you can't even transpose the data into a rectangular two-column structure.

A correlation betwen variables requires that you have well-defined observations. An observation is a tuple (x1, x2, ...,xp) and those observations (theoretically) are a random draw from a joint distribution. The assumptions for long data are much different.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-07-2016 12:28 PM - edited 09-07-2016 12:29 PM

I don't think it is true that a long data set is "typically used when you have levels of categorical variables that are unequal in size" (and I don't think your male/female example is relevant for the instance of longitudinal anaysis either). As mentioned in my post, a long dataset is REQUIRED for any longitudinal modeling, regardless of issues related to sample size. For example, you may have N number of subjects, each with K repeated measurements, and you want to find the correlation between measurements of each repetition of K (which is a common and necessary step for any longitudinal analysis).

I understand there are issues related to the GENERALIZABILITY of calculating correlations with a long dataset,but not with the possibility of it per se. In fact, I don't even think unequal sample sizes per bin are that insurmountable of an issue, at least from a computational stand-point (though of course, as with anything, you have to be careful to understand what assumptions you are making in such an instance). After all, if it were such a huge computational problem it wouldn't be possible for PROC GENMOD to estimate the working correlation structure under the hood, because it also must leverage a long dataset.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-07-2016 11:40 PM

You could avoid the creation of a wide dataset with a data view.

```
/* Example data in long format */
data long;
set sashelp.class;
var = "Weight"; value = weight; output;
var = "Height"; value = height; output;
var = "Age"; value = age; output;
keep name var value;
run;
/* get the wide variable names */
proc sql;
select unique var
into :vars separated by " "
from long;
quit;
/* Create a view to transpose the data */
data wide / view=wide;
do until(last.name);
set long;
by name notsorted;
array _v &vars;
do i = 1 to dim(_v);
if vname(_v{i}) = var then do;
_v{i} = value;
leave;
end;
end;
end;
keep &vars;
run;
/* Call CORR which will call wide view */
proc corr data=wide outp=myCorr;
run;
```

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 01:20 PM

Hi Ryan,

Is there anything untoward about fitting the data with different error structures, and looking at information criteria as a method of selecting the optimal available structure? Of course, certain structures may be off limits due to the nature of the longitudinal variable (an autoregressive structure should only be used for equally spaced data). If the correlation structures in GENMOD are too limited, consider porting to GLIMMIX, and using a RANDOM statement with the residual option.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 01:20 PM - edited 09-12-2016 01:23 PM

I hit post too many times, and it looks like I can't delete the duplicate post, so I am editing it as an apology.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 01:26 PM

Hi Steve,

No, there is nothing untowards about that process, though I would argue it's not directly relevant to my question. In practice, this is often how we do model longitudinal data, by using QIC or a similar metric (in the case of GEEs) to decide upon the error structure that provides the best fit to the data. However, I would argue that this is a rather unpricipled approach in certain situations (for example, for small or unbalanced datasets, when the asymptotic consistency of the sandwich estimator isn't guaranteed to hold, in which case changing the error structure may lead to radically different interpretations), and that fitting ANY model without an a priori idea of what the empirical structure of your data looks like is a potentially dangerous approach.

In any case, the problems I address in the OP are easily resolved through a simple transposition of the data and a call to PROC CORR. I guess I just find it mildly irritating that there isn't an option in, say, PROC GENMOD to output the empirical correlation matrix to compare side-by-side with the working correlation matrix, or some similar automated method for evaluating model assumptions.

Thanks,

Ryan Simmons

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 01:41 PM

Hi Ryan,

Excellent point, and it shouldn't be so difficult to get within the PROC.

That's the good think about GLIMMIX. Using the GCORR option on the RANDOM statement gives a model dependent estimate. If you fit an unstructured matrix and no fixed effects (and it converges), the working and the empirical correlations should (within roundoff) be the same. (I think )

Spoiler

I worry that this is conjecture and would love to see it confirmed or disproved.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 02:52 PM

Hi Steve,

That's an interesting approach. I am having some problems actually reproducing that approach, however. I believe I am having some problems related to correctly specifying the random effects vis-a-vis G-side/R-side (the terminology and notation SAS uses is different from how I learned it in my theory classes, so I always have a hard time with GLIMMIX!). I tried the three following specifications:

```
PROC GLIMMIX data=st_analysis(where=(band="alpha"));
class time dukeid;
model log_RP = ;
random dukeid / type=un gcorr;
run;
PROC GLIMMIX data=st_analysis(where=(band="alpha"));
class time dukeid;
model log_RP = ;
random intercept / subject=dukeid type=un gcorr;
run;
PROC GLIMMIX data=st_analysis(where=(band="alpha"));
class time dukeid;
model log_RP = ;
random residual / subject=dukeid type=un gcorr;
run;
```

The first one gives me errors in the optimization routine.

The second one (which, I believe, is specifying G-side random effects) gives me an estimated G correlation matrix with only one element (an Intercept with a value equal to 1.000).

The third one (which should be specifying R-side random effects) gives me a covariance matrix (oddly, it doesn't output a correlation matrix, but a covariance matrix, despite the documentation assuring the gcorr option outputs the former) that I can't reconcile with that I create using PROC CORR (and the COV option). For example, an element of the GLIMMIX-output covariance matrix is equal to 2.428 (s.e. 0.971), but the largest element in the PROC CORR output is 1.58 (this proc doesn't seem to give precision estimates that I can tell).

Perhaps this approach is also sensitive to small sample size and imbalance due to missingness (21.8% of the 225 possible observations are missing). There is another issue with this data, in that it empiricaly displays what I would call "inverse autocorrelation", in that measurements further apart have higher correlations than measurements closer together, due to a V-shaped trajectory in individual measurements over time; possibly this unorthodox structure, combined with the aforementioned issues, simply make this type of estimation unreliable?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 02:52 PM

For clarification, what correlation is desired by the OP if the mixed model formulated as

Y = X*beta + Z*gamma + epsilon ?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 02:59 PM

I admit that my OP was motivated more by the case of a GEE than a mixed model, and was thinking in terms of the working correlation matrix in a GEE which has a different specification. However, the same logic applies to mixed models. In which case, the interest would (I believe) be in the R-side covariance matrix associated with epsilon. That is, the correlations between each individual's measurements over time.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-08-2016 07:16 AM

Sorry. I can't get all your points. Could you use nested effect to get it ? model y= x z(group=1) ;

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

09-12-2016 01:27 PM

Hi Xia,

I'm not sure I understand the relevance of nested effects to my question. Would you mind elaborating?

Thanks,

Ryan Simmons