BookmarkSubscribeRSS Feed
kpboh
Calcite | Level 5

I'm looking for some help specifying a model using PROC MIXED.

 

My dataset consists of individuals (variable = 'id') from 13 populations (variable = "pop"). I am attempting to model varation in 3 different response variables: y1, y2, and y3. I was able to rearrange my dataset into the 'stacked' format, with a single line for each response trait, coded by a new categorical variable called "var".

 

An example of my dataset:

 

data tall;
input pop$ id$ var$ y;
cards;
AF	1	y1	1.11186
AF	1	y2	1.10318
AF	1	y3	1.03751
AF	2	y1	1.05796
AF	2	y2	1.07508
AF	2	y3	1.0011
AF	3	y1	1.16453
AF	3	y2	1.02108
AF	3	y3	0.93194
AF	4	y1	1.09352
AF	4	y2	0.83194
AF	4	y3	0.99382
AF	5	y1	1.12513
AF	5	y2	1.14387
AF	5	y3	1.01929
...

The goal is to estimate the variance-covariance matrix for the population term ('pop'), so I've included it in the RANDOM statement. But I'm having trouble understanding how to utilize the RANDOM and REPEATED commands in this context. I have no fixed terms/covariates in the model. This is what I've been trying so far...

 

proc mixed data=tall method=reml covtest asycov;
class var pop;
model y= /noint notest;
random pop /type=UN; 
repeated var /subject=id type=VC;

Thanks in advance for any suggestions.

 

 

15 REPLIES 15
kpboh
Calcite | Level 5

moved to analytics community...

mkeintz
PROC Star

You might want to move this topic to the analytics community.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
SteveDenham
Jade | Level 19

Try the following to see if the results make sense:

 

proc mixed data=tall method=reml covtest asycov;
class var pop;
model y=var /noint;
random pop;; 
repeated var /subject=id type=unr;

Here I would consider the variable to be repeated measures on the subject, with an unstructured latent correlation matrix.  I would continue to consider population as a random effect, but not try to model a structure between populations.  I would think that the populations represent a random sample from possible populations, and would lead to a single variance component.

 

Steve Denham

 

 

 

kpboh
Calcite | Level 5

Thanks for the suggestion Steve. However, when I ran the command you suggested it returned the following error:

 

NOTE: An infinite likelihood is assumed in iteration 0 because of a nonpositive definite estimated R matrix for Subject 1.

And I think I understand what you're saying about between-population variance. But the main goal of the analysis is to estimate the variance-covariance matrix of the 3 response variables across the populations.

 

i.e.,

 


y1 y2 y3 y1 variance(y1) covariance(y1,y2) covariance(y1,y3) y2 covariance(y2,y1) variance(y2) covariance(y2,y3) y3 covariance(y3,y1) covariance(y3,y2) variance(y3)

 Any suggestions on how to generate this would be greatly appreciated. 

SteveDenham
Jade | Level 19

Well, once we get the infinite likelihood problem solved, the variance-covariance estimates should be included in the output.

 

Is there any chance that the Id variable is repeated across populations?  This is a common cause of non-unique records.  If so, replace the subject=id in the REPREATED statement with subject=id*pop.  

 

Steve Denham

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

@SteveDenham, following your lead (which looks like a good direction to me), what do you think about something like this:

proc mixed data=tall method=reml covtest asycov;
  class var pop id;
  model y=var ;
  random var / subject=pop type=unr;
  repeated var / subject=id(pop) type=unr;
  run;

It's a bit manic. 

Does the covariance structure of the entire model make sense? I'm not sure.

 

I added id to the class statement which I think will help with the estimation problem.

I dropped noint from the model statement but doubt that matters.

Would there be any hope of convergence with only 13 levels of pop (edit: and possibly few levels of id within pop) and so many covariance parameter estimates?

 

SteveDenham
Jade | Level 19

@sld, I don't much like that RANDOM statement.  You are right about too many parameters (78) to be estimated.  Simpler, I think, to view the populations as a random sample from possible populations, and estimate a single variance component (At least to start.)

 

I think the next logical step might be to look at this with each variable (of y1, y2 and y3) having its own estimate for variability, so how about:

 

random intercept / subject=pop group=var;

 

This would help guarantee that the diagonal in @kpboh's requested matrix had different values.

 

Steve Denham

kpboh
Calcite | Level 5

Steve,

Thanks for the follow up. I appreciate your help working through this.

 

And the ids are indeed unique, both within AND across populations.

On a related note--I presume 'id' should be input as a categorical variable, no? I ask because when I submitted the code you suggested above (with "subject=id") it returned:

 

ERROR: Variable id should be either numeric or in CLASSES list.

When 'id' is added to the class list, it returns the infinite likelihood error...

SteveDenham
Jade | Level 19

Shoot, I hoped it was as easy as duplicate records.

 

In the data sample you provided, is Subject 1 the complete set:

 

AF	1	y1	1.11186
AF	1	y2	1.10318
AF	1	y3	1.03751

or is there more someplace else?  What I am starting to be afraid of is some sort of linear dependency.

 

If this is typical for all subjects (3 observations, each indexed by a y variable), it really isn't a lot different from a split-plot in time as far as analysis goes.  You might consider recoding y1, y2, y3 to integer values, and sorting with a nodupkey option to see if the output dataset has the same number of observations.  

 

See the SGF paper by Kiernan et al.  http://support.sas.com/resources/papers/proceedings12/332-2012.pdf.  There is a specific section on page 10 that deals with this NOTE. It only appears when there is some sort of duplicate record, or misspecification of the subject so that it leads to duplicate records.

 

Steve Denham

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

Hi Steve @SteveDenham and @kpboh,

 

Thoughts:

 

(1) Regarding the number of parameters: 78 seems too high. I was thinking it would be 6 for the first random, 6 for the second random, and another 3 for the var means. But 15 is still a lot, and maybe I'm not counting correctly. I thought about taking this problem for a spin with simulated data, but I haven't had the time.

 

(2) kpboh, as Steve suggests, you can check the data structure. I use the TABULATE procedure a lot for this task:

 

proc tabulate data=tall;
  class pop id var;
  table pop*id, var;
  run;

I think you would hope to see "1" in each cell. 

 

 

(3) kpboh, in your original message, your example dataset was "pops", but the dataset used in GLIMMIX was "tall". Are these two datasets equivalent?

 

(4) kpboh, it's not clear whether you ran a model with both

class var pop id;

and

repeated var /subject=id(pop) type=unr;

and still had lack of convergence. You might have, it's just not clear.

 

(5) If my model suggestion is not feasible and if you want covariances at the pop level, could you average the y1, y2, and y3 values over the ids within each pop? Do you have the same number of ids in each pop? (Does it even make sense to average over ids, in context or statistically?) How else can we get at covariances at the pop level....? (There's the rub.)

proc mixed data=tall_MEANS method=reml covtest asycov;
  class var pop;
  model y=var ;
  repeated var / subject=pop type=unr;
  run;

 

Susan

 

kpboh
Calcite | Level 5

Dear @sld,

 

Thanks very much for the suggestions. Following are my responses:

 

2) The TABULATE procedure is new to me and very handy indeed. Thanks! And yes--it shows a "1" in all cells of the dataset.

 

(3) Sorry for the confusion. The original dataset was "pops" but I had converted it to the "tall" format (i.e., 3 entries per individual). Looks like I made an error in the copy/paste. I corrected the dataset name in the original post.

 

(4) Apologies for not reporting back, but yes--I ran the model you had suggested:

 

proc mixed data=tall method=reml covtest asycov;
class var pop id;
model y=var;
random var/subject=pop type=unr;
repeated var /subject=id(pop) type=unr;
run;
quit;

Unfortunately, this returned the "infinite likelihood" error mentioned above.

 

(5) And yes, it had occurred to me that I could simply generate populations means for the three response variable, then simply estimate the variance-covariance matrix from those means (and indeed, this is common practice for this particular type of analysis). Others have shown similar results doing this versus the random effect variance component approach... My hope was to have both to be able to compare. 

 

And to flesh out the dataset a bit more:

1. there are no missing data

2. the dataset is unbalanced--the number of 'ids' per population ranges from n=10 to n=85 (mean is 39.3 individuals per populations)

 

Finally, I should mention that I had been working off of this document which discusses multivariate analysis using PROC MIXED. Unfortunately, none of the examples therein included both a RANDOM and a REPEATED statement.

 

Thanks again,

kpboh


 

 

 



 

 

sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

(2) The documentation for MIXED | Details | Computational Issues | Convergence Problems includes this statement:

 

"If PROC MIXED stops because of an infinite likelihood, recheck your model to make sure that no observations from the same subject are producing identical rows in $\mb{R}$ or $\mb{V}$ and that you have enough data to estimate the particular covariance structure you have selected."

 

Your TABULATE results show that the problem is not due to coding (observations from the same subject producing identical rows in $\mb{R}$ or $\mb{V}$), which means that it is probably due to not enough data. Adjusting parameters that control the estimation process (covered in the documentation and the Kiernan et al. paper that @SteveDenham recommended) might help, or not. 

 

So....what next? Other than give up. I'd follow Steve's lead and try simpler covariance structures, building up to more complex, and see how close you can get to what you want.

 

A split-plot-in-time-like model that specifies 3 sources of variance (in addition to residual variance) with default covariance structures

/*RandomStmt1*/ random intercept / subject=pop;
/*RandomStmt2*/ random intercept / subject=id(pop);
/*RandomStmt3*/ random var / subject=pop;

Try 1: RandomStmt1 and RandomStmt2 (i.e., drop RandomStmt3)

Try 2: RandomStmt1 and RandomStmt2 and RandomStmt3

Try 3: Steve's suggestion: RandomStmt1 with group=var

Try 4: RandomStmt1 with type=un or unr or fa0(3) (see http://comp.soft-sys.sas.narkive.com/RqRMBFgu/difference-between-un-vs-unr-type-option-in-random-sta...)

Try 5: RandomStmt2 with default covariance structure, and RandomStmt3 with type=un or unr or fa0(3)

Try other combinations that you think are appropriate, depending on whether simpler Trys converge.

 

Clearly, I haven't attempted these, and so if one looks wrong, it might be wrong.

 

None of these Trys are doing anything fancy with residual variance. Should they? You don't appear to be interested in covariance structure for residual variance--your focus is on covariance structure for pop variance--and maybe you can ignore residual variance to some extent, although I'm not sure that's statistically valid.

 

 

SteveDenham
Jade | Level 19
The 78 comes from 13*12/2 for an unstructured matrix.
sld
Rhodochrosite | Level 12 sld
Rhodochrosite | Level 12

On the number of parameters topic, I disagree. I'm happier when I have a data set to test, so I dummied one up (very badly; if I had time, I'd pull down my copy of Rick Wicklin's Simulating Data with SAS and figure out how to simulate data with specified covariance parameters, to check against estimates). [Edit: For the model I posed earlier, the number of random effects parameters is 12; see the attached code example.] I'm thinking my suggested model (or a variant) would work--if it would converge with kpboh's data, which is clearly an unresolved problem. See what you think about the code in the attached file.

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 15 replies
  • 2499 views
  • 0 likes
  • 4 in conversation