BookmarkSubscribeRSS Feed
jsac
Calcite | Level 5
I am trying to run a simple linear regression, however I have 2 observations for each individual in my sample (each obs collected on each of two non-consecutive days). I recognize that one option is to take the mean of the observations and run using a proc reg, however I was hoping to pool my data to increase my sample size, and then correct for the fact that two obs. came from each individual. I understand that proc mixed is an option here, but I am unclear of how to approach this. What I have so far is:

proc mixed data=new;
class id;
model serumHg = fishintake/solution;
repeated /subject=id;
run;

Any help would be very much appreciated
5 REPLIES 5
Dale
Pyrite | Level 9
Is there a constant number of days between observations from one subject to the next? If so, then you could use code which is only slightly modified from the code which you show. For a consistent number of days between the two observations, you could employ the code

proc mixed data=new;
class id;
model serumHg = fishintake/solution;
repeated /subject=id type=cs;
run;

An alternate specification of the MIXED procedure which would produce the same result is

proc mixed data=new;
class id;
model serumHg = fishintake/solution;
random intercept /subject=id;
run;


Both of the above models assume that the residual variance is the same for each of the two measures. If you believe that is not a tenable assumption then you could use the code:

proc mixed data=new;
class id;
model serumHg = fishintake/solution;
repeated /subject=id type=un;
run;


As mentioned previously, the above models are appropriate if the number of days between observations is consistent from one subject to the next. If that is not the case, then you might need to employ a spatial covariance structure. (Note that time is the fourth dimension, so spatial structures are appropriate for modeling observations which are more or less distant in time.)

Let me make one more comment. You really do not gain in degrees of freedom when using the individual observations as compared with using the subject means. Using the individual observations can be important if there is some complexity to the residual variance structure like when there is a different amount of time between observations. Using the individual observations could also be important if you have period-specific predictors to incorporate into your model. Using the mixed model would also be indicated if you are really interested in understanding components of variance.

From the limited description which you have provided, it is my guess that the model in which you average the two responses per subject and regress those on the (single) predictor variable would be just as good for your needs as the mixed model. But that assumption is based on a guess about how your experiment is conducted based on limited information.
jsac
Calcite | Level 5
Thank you so much for your insight, it was really helpful. I think I will now seriously consider taking the mean of my 2 observations - but just to clarify, my two days of data were collected 3-10 days apart, therefore not consistent from one subject to the next, so in this case you recommend a spatial covariance structure?
Dale
Pyrite | Level 9
Whether 3 days or 10 days produce a difference in the covariance structure of the subject-specific values probably depends on a lot of considerations that I don't have knowledge of. From your model, I see that your predictor variable is fishintake. You appear to be modeling serum mercury in fish based on the amount of food that they have consumed - or the serum mercury of an animal which feeds on fish such as river otters.

How much mercury is taken up and expressed in serum probably depends on fish (or river otter) age. If you are studying juveniles, then a difference of 3 days compared to a difference of 10 days could make a substantial difference. But this is just speculation on my part. You should investigate alternative models starting with the compound symmetry model specified previously (alternatively, the random effects model). For a spatial model, you could use code as follows:

proc mixed data=new;
class id;
model serumHg = fishintake/solution;
repeated /subject=id type=sp(pow)(time);
run;

where time is measurement date. The compound symmetry and spatial covariance models are not nested, so you cannot formally test which is better using a likelihood ratio test. However, I would note that the covariance structure of the compound symmetry model can be expressed as

                      _                   _
Cov(R1, R2) = | V          V*rho |
                     | V*rho     V      |
                      --                   --

while the spatial covariance structure can be expressed as:

                      _                                           _
Cov(R1, R2) = | V                      V*(rho**d{12}) |
                     | V*(rho**d{12})     V                  |
                      --                                          --


where d{12} is the difference in days between the first and second measurement. You will note that both models are identical with the exception that the spatial model incorporates the distance between measurements as a correction to the covariance between the two measures with the distance between measurements a known quantity (not a parameter to estimate). Thus, whichever of these models has the smaller value of -2LL would be the preferred model.

There are other spatial covariance structures which you could employ as an alternative to the spatial power model specified above. See the REPEATED statement syntax for the MIXED procedure for other spatial covariance structures. Again, for the spatial covariance structures which you might employ (sp(exp), sp(gau), sp(lin), sp(linl), sp(sph)), there will not be a likelihood ratio test that allows selection of the best model. Model selection may be based on established literature on the subject or on which model produces the smallest value for -2LL.
jsac
Calcite | Level 5
Thanks very much for your help,

I think I know where I can go from here!
JuanVte
Calcite | Level 5
If you only have two observations and these are more or less equidistant I would simplify the problem either adjusting by the baseline value or analysing the difference from baseline:

proc glm data=new;
model serumHg_second_measurement = serumHG_baseline fishintake /solution;
run;

or

proc glm data=new;
model serumHG_difference = fishintake /solution;
run;

where serumHG_difference = final - baseline

Regards,
Juanvte.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 2047 views
  • 0 likes
  • 3 in conversation