Solved: Re: Proc mixed for unbalanced data set

gabmv · Posted 06-11-2014 03:57 PM

Hello all,

First of all, I am pretty new to mixed model but to analyze an experiment, I need help to understand what I am doing. Thanks in advance.

My data set is the following:

130 groups (suspc.) in which 4 random samples within the group are analyzed for 2 consecutive year (the 4 samples the following year are chosen at random as well, not the same as the previous year) (8 samples per group in total). These sample are analyzed to get the concentration of 2 components, A and B. The data set is unbalanced: data on concentration of A and/or B may be missing for some samples.

Thus, random var=(sample within the group) ; fixed var=(group), (component), (year) ; y=(concentration)

Then, I want to analyze which groups have a greater concentration of A and B and if the year has a significance importance. I also want to see if there is a significant correlation between the concentration of A and B in general (not for every single group).

So I don't really know where to start other than building my data set in long form and trying to run a PROC MIXED.

I am relatively new to SAS and mixed models.

Thank you very much guys,

Gabriel,

Undergrad Student in ChemE

SteveDenham · Posted 06-13-2014 10:48 AM

I think the major difference is that I invented a variable that doesn't exist anywhere in the design, and I apologize. This last code that you present is now in agreement with the final edited code I have upstream.

Now suppose the components ARE correlated in some way. You could model that situation as follows:

proc mixed data=WORK.FEZN ;

class year accession component subject;

model conc=year|accession|component;

repeated component/subject=subject type=unr;

lsmeans year|accession|component/diff adjdfe=row adjust=simulate(seed=1);

run;

The type=unr output will give the correlation between the two components, which I assume are iron and zinc concentrations. Under this model, you do not assume a priori that the two are uncorrelated, which may or may not be the case given hydrological stressors in agronomic or ecological experiments, for example. This does require hovever that the concentrations are obtained from the same subject.

Steve Denham

View solution in original post

SteveDenham · Posted 06-12-2014 11:57 AM

Hi Gabriel,

It looks like a good start. I would recommend getting a copy of SAS for Mixed Models, 2nd ed. by Littell et al., and looking through the examples there. Unbalanced data of the kind you are talking about should not be a problem. Two question-s-are the components A and B correlated in some manner, or ate they independent? What kind of distribution do the concentrations follow? That is, for many biological analytes the data are lognormally distributed rather than normally distributed.

I have a tentative model in mind, but answers to those questions would help a lot.

Steve Denham

gabmv · Posted 06-12-2014 12:24 PM

Hello Steve,

Thank you very much for your answer,

The concentration of the 2 components are not correlated; they are independent. The data for the concentration are normally distributed.

Gabriel,

SteveDenham · Posted 06-12-2014 02:38 PM

This will be simpler than I feared.

Try:

proc mixed data=yourdata;

by component;

class year group;

model concentration=class year cless*year;

lsmeans class year class*year/diff adjdfe=row adjust=simulate(seed=1);

run;

In this case the residual error is due to the samples within a group-year combination, and does not need to be specified in a random statement.

Steve Denham

gabmv · Posted 06-12-2014 03:36 PM

Thank you Steve,

Your program definitely is helpful.

I am still a little bit confused on why the sample with the group should not be specified in a random statement. I guess my initial statement was not exactly clear. For each 130 groups, I have over 20 to 50 subjects and I randomly chose only 4 per year. I guess calling the subjects chosen sample was confusing.

Thank you for your very helpful input!

Gabriel

SteveDenham · Posted 06-12-2014 03:43 PM

I'll try to answer the question by saying, as I see it, you have 4 measurments of component A for each class by year cell. There is no other design factor involved. If I am missing something, then I will incorporate it as I get filled in.

What does concern me a little is the use of only 4, rather than the entire dataset at each time point. PROC MIXED can easily handle more data.

Steve Denham

gabmv · Posted 06-12-2014 03:54 PM

To clarify my statement:

I have 130 different accessions (can be seen as a subspecies or something similar) that I want to evaluate to find which one maximize the concentration of component A and B. To do so, I randomly selected 4 subjects per year per accession got the concentration of component A and B for every sample.

There is only 2 time points: 2 different years. It was too time-extensive to have more than 8 samples total per accession.

2080 concentration values (1040 for A and 1040 for B, 520 per component per year. 130*4 per component )

My database looks like this 2080x5

year / accession / subject / component / concentration

Hopefully that clarify a bit. I realize that my initial post was really confusing.

SteveDenham · Posted 06-13-2014 08:41 AM

So accession is a factor. Now comes the question--is it random (the 130 represent some sort of sample from an entire universe of accessions, and you wish to make inferences about that universe) or fixed (you wish to make inferences about those 130 specific accessions). For the fist case, try:

proc mixed data=yourdata;

by component;

class year accession;

model concentration=year;

random accession year*accession;

lsmeans year/diff adjdfe=row adjust=simulate(seed=1);

run;

For the second case:

proc mixed data=yourdata;

by component;

class year accession;

model concentration=year accession year*accession;

lsmeans year accession year*accession/diff adjdfe=row adjust=simulate(seed=1);

run;

Steve Denham

Message was edited by: Steve Denham

Message was edited AGAIN by: Steve Denham

gabmv · Posted 06-13-2014 08:52 AM

Thank you SO much,

the 130 accession represent the entire collection I want to evaluate!

Gabriel

gabmv · Posted 06-13-2014 10:01 AM

Just for clarification,

In the second code, what does "class" refer to in the lsmeans statement.

And also, I still don't see why "subject" would not be a random statement. (The 8 subjects evaluated per accession)

You are very helpful, thank you

Gabriel

SteveDenham · Posted 06-13-2014 10:07 AM

Cut and paste error explains the 'class' in the lsmeans statement. I've gone back and edited the post.

You can put subject in as a random statement, but the results will be exactly the same as not including it. The reason is that it only indexes the lowest level of observation--it is just another name for the residual error in this design, and you don't have to specify it.

Steve Denham

gabmv · Posted 06-13-2014 10:23 AM

Okay, I see, neglecting the random statement makes a lot of sense.

So, I end up having:

data FeZn;

infile '/folders/myfolders/FeZn.csv' dlm=',' firstobs=2;

input Sample Accession Year comp $ conc;

run;

proc mixed data=WORK.FEZN ;

by comp;

class year accession;

model conc=class year accession class*year class*accession year*accession class*year*accession;

lsmeans ear accession class*year class*accession year*accession class*year*accession/diff adjdfe=row adjust=simulate(seed=1);

run;

I do not understand why class is present in either model or lsmeans statement

So I did:

proc mixed data=WORK.FEZN ;

by comp;

class year accession;

model conc=year accession year*accession;

lsmeans year accession year*accession/diff adjdfe=row adjust=simulate(seed=1);

run;

what is the major difference here?

SteveDenham · Posted 06-13-2014 10:48 AM

I think the major difference is that I invented a variable that doesn't exist anywhere in the design, and I apologize. This last code that you present is now in agreement with the final edited code I have upstream.

Now suppose the components ARE correlated in some way. You could model that situation as follows:

proc mixed data=WORK.FEZN ;

class year accession component subject;

model conc=year|accession|component;

repeated component/subject=subject type=unr;

lsmeans year|accession|component/diff adjdfe=row adjust=simulate(seed=1);

run;

The type=unr output will give the correlation between the two components, which I assume are iron and zinc concentrations. Under this model, you do not assume a priori that the two are uncorrelated, which may or may not be the case given hydrological stressors in agronomic or ecological experiments, for example. This does require hovever that the concentrations are obtained from the same subject.

Steve Denham

gabmv · Posted 06-13-2014 10:52 AM

Thank you Steve for all your help, everything works flawlessly now!