I'm looking for some help specifying a model using PROC MIXED.
My dataset consists of individuals (variable = 'id') from 13 populations (variable = "pop"). I am attempting to model varation in 3 different response variables: y1, y2, and y3. I was able to rearrange my dataset into the 'stacked' format, with a single line for each response trait, coded by a new categorical variable called "var".
An example of my dataset:
data tall; input pop$ id$ var$ y; cards; AF 1 y1 1.11186 AF 1 y2 1.10318 AF 1 y3 1.03751 AF 2 y1 1.05796 AF 2 y2 1.07508 AF 2 y3 1.0011 AF 3 y1 1.16453 AF 3 y2 1.02108 AF 3 y3 0.93194 AF 4 y1 1.09352 AF 4 y2 0.83194 AF 4 y3 0.99382 AF 5 y1 1.12513 AF 5 y2 1.14387 AF 5 y3 1.01929 ...
The goal is to estimate the variance-covariance matrix for the population term ('pop'), so I've included it in the RANDOM statement. But I'm having trouble understanding how to utilize the RANDOM and REPEATED commands in this context. I have no fixed terms/covariates in the model. This is what I've been trying so far...
proc mixed data=tall method=reml covtest asycov;
class var pop;
model y= /noint notest;
random pop /type=UN;
repeated var /subject=id type=VC;
Thanks in advance for any suggestions.
moved to analytics community...
You might want to move this topic to the analytics community.
Try the following to see if the results make sense:
proc mixed data=tall method=reml covtest asycov;
class var pop;
model y=var /noint;
random pop;;
repeated var /subject=id type=unr;
Here I would consider the variable to be repeated measures on the subject, with an unstructured latent correlation matrix. I would continue to consider population as a random effect, but not try to model a structure between populations. I would think that the populations represent a random sample from possible populations, and would lead to a single variance component.
Steve Denham
Thanks for the suggestion Steve. However, when I ran the command you suggested it returned the following error:
NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive definite estimated R matrix for Subject 1.
And I think I understand what you're saying about between-population variance. But the main goal of the analysis is to estimate the variance-covariance matrix of the 3 response variables across the populations.
i.e.,
y1 y2 y3 y1 variance(y1) covariance(y1,y2) covariance(y1,y3) y2 covariance(y2,y1) variance(y2) covariance(y2,y3) y3 covariance(y3,y1) covariance(y3,y2) variance(y3)
Any suggestions on how to generate this would be greatly appreciated.
Well, once we get the infinite likelihood problem solved, the variance-covariance estimates should be included in the output.
Is there any chance that the Id variable is repeated across populations? This is a common cause of non-unique records. If so, replace the subject=id in the REPREATED statement with subject=id*pop.
Steve Denham
@SteveDenham, following your lead (which looks like a good direction to me), what do you think about something like this:
proc mixed data=tall method=reml covtest asycov; class var pop id; model y=var ; random var / subject=pop type=unr; repeated var / subject=id(pop) type=unr; run;
It's a bit manic.
Does the covariance structure of the entire model make sense? I'm not sure.
I added id to the class statement which I think will help with the estimation problem.
I dropped noint from the model statement but doubt that matters.
Would there be any hope of convergence with only 13 levels of pop (edit: and possibly few levels of id within pop) and so many covariance parameter estimates?
@sld, I don't much like that RANDOM statement. You are right about too many parameters (78) to be estimated. Simpler, I think, to view the populations as a random sample from possible populations, and estimate a single variance component (At least to start.)
I think the next logical step might be to look at this with each variable (of y1, y2 and y3) having its own estimate for variability, so how about:
random intercept / subject=pop group=var;
This would help guarantee that the diagonal in @kpboh's requested matrix had different values.
Steve Denham
Steve,
Thanks for the follow up. I appreciate your help working through this.
And the ids are indeed unique, both within AND across populations.
On a related note--I presume 'id' should be input as a categorical variable, no? I ask because when I submitted the code you suggested above (with "subject=id") it returned:
ERROR: Variable id should be either numeric or in CLASSES list.
When 'id' is added to the class list, it returns the infinite likelihood error...
Shoot, I hoped it was as easy as duplicate records.
In the data sample you provided, is Subject 1 the complete set:
AF 1 y1 1.11186
AF 1 y2 1.10318
AF 1 y3 1.03751
or is there more someplace else? What I am starting to be afraid of is some sort of linear dependency.
If this is typical for all subjects (3 observations, each indexed by a y variable), it really isn't a lot different from a split-plot in time as far as analysis goes. You might consider recoding y1, y2, y3 to integer values, and sorting with a nodupkey option to see if the output dataset has the same number of observations.
See the SGF paper by Kiernan et al. http://support.sas.com/resources/papers/proceedings12/332-2012.pdf. There is a specific section on page 10 that deals with this NOTE. It only appears when there is some sort of duplicate record, or misspecification of the subject so that it leads to duplicate records.
Steve Denham
Hi Steve @SteveDenham and @kpboh,
Thoughts:
(1) Regarding the number of parameters: 78 seems too high. I was thinking it would be 6 for the first random, 6 for the second random, and another 3 for the var means. But 15 is still a lot, and maybe I'm not counting correctly. I thought about taking this problem for a spin with simulated data, but I haven't had the time.
(2) kpboh, as Steve suggests, you can check the data structure. I use the TABULATE procedure a lot for this task:
proc tabulate data=tall; class pop id var; table pop*id, var; run;
I think you would hope to see "1" in each cell.
(3) kpboh, in your original message, your example dataset was "pops", but the dataset used in GLIMMIX was "tall". Are these two datasets equivalent?
(4) kpboh, it's not clear whether you ran a model with both
class var pop id;
and
repeated var /subject=id(pop) type=unr;
and still had lack of convergence. You might have, it's just not clear.
(5) If my model suggestion is not feasible and if you want covariances at the pop level, could you average the y1, y2, and y3 values over the ids within each pop? Do you have the same number of ids in each pop? (Does it even make sense to average over ids, in context or statistically?) How else can we get at covariances at the pop level....? (There's the rub.)
proc mixed data=tall_MEANS method=reml covtest asycov; class var pop; model y=var ; repeated var / subject=pop type=unr; run;
Susan
Dear @sld,
Thanks very much for the suggestions. Following are my responses:
2) The TABULATE procedure is new to me and very handy indeed. Thanks! And yes--it shows a "1" in all cells of the dataset.
(3) Sorry for the confusion. The original dataset was "pops" but I had converted it to the "tall" format (i.e., 3 entries per individual). Looks like I made an error in the copy/paste. I corrected the dataset name in the original post.
(4) Apologies for not reporting back, but yes--I ran the model you had suggested:
proc mixed data=tall method=reml covtest asycov;
class var pop id;
model y=var;
random var/subject=pop type=unr;
repeated var /subject=id(pop) type=unr;
run;
quit;
Unfortunately, this returned the "infinite likelihood" error mentioned above.
(5) And yes, it had occurred to me that I could simply generate populations means for the three response variable, then simply estimate the variance-covariance matrix from those means (and indeed, this is common practice for this particular type of analysis). Others have shown similar results doing this versus the random effect variance component approach... My hope was to have both to be able to compare.
And to flesh out the dataset a bit more:
1. there are no missing data
2. the dataset is unbalanced--the number of 'ids' per population ranges from n=10 to n=85 (mean is 39.3 individuals per populations)
Finally, I should mention that I had been working off of this document which discusses multivariate analysis using PROC MIXED. Unfortunately, none of the examples therein included both a RANDOM and a REPEATED statement.
Thanks again,
kpboh
(2) The documentation for MIXED | Details | Computational Issues | Convergence Problems includes this statement:
"If PROC MIXED stops because of an infinite likelihood, recheck your model to make sure that no observations from the same subject are producing identical rows in or and that you have enough data to estimate the particular covariance structure you have selected."
Your TABULATE results show that the problem is not due to coding (observations from the same subject producing identical rows in or ), which means that it is probably due to not enough data. Adjusting parameters that control the estimation process (covered in the documentation and the Kiernan et al. paper that @SteveDenham recommended) might help, or not.
So....what next? Other than give up. I'd follow Steve's lead and try simpler covariance structures, building up to more complex, and see how close you can get to what you want.
A split-plot-in-time-like model that specifies 3 sources of variance (in addition to residual variance) with default covariance structures
/*RandomStmt1*/ random intercept / subject=pop;
/*RandomStmt2*/ random intercept / subject=id(pop);
/*RandomStmt3*/ random var / subject=pop;
Try 1: RandomStmt1 and RandomStmt2 (i.e., drop RandomStmt3)
Try 2: RandomStmt1 and RandomStmt2 and RandomStmt3
Try 3: Steve's suggestion: RandomStmt1 with group=var
Try 4: RandomStmt1 with type=un or unr or fa0(3) (see http://comp.soft-sys.sas.narkive.com/RqRMBFgu/difference-between-un-vs-unr-type-option-in-random-sta...)
Try 5: RandomStmt2 with default covariance structure, and RandomStmt3 with type=un or unr or fa0(3)
Try other combinations that you think are appropriate, depending on whether simpler Trys converge.
Clearly, I haven't attempted these, and so if one looks wrong, it might be wrong.
None of these Trys are doing anything fancy with residual variance. Should they? You don't appear to be interested in covariance structure for residual variance--your focus is on covariance structure for pop variance--and maybe you can ignore residual variance to some extent, although I'm not sure that's statistically valid.
On the number of parameters topic, I disagree. I'm happier when I have a data set to test, so I dummied one up (very badly; if I had time, I'd pull down my copy of Rick Wicklin's Simulating Data with SAS and figure out how to simulate data with specified covariance parameters, to check against estimates). [Edit: For the model I posed earlier, the number of random effects parameters is 12; see the attached code example.] I'm thinking my suggested model (or a variant) would work--if it would converge with kpboh's data, which is clearly an unresolved problem. See what you think about the code in the attached file.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.