Solved: OLS repeated-measures regression with fixed effect for subject and clu...

confooseddesi89 · Posted 09-22-2022 05:19 PM

Hello,

I am working on some repeated-measures long-format data where each subject (identified by user_ID) has a value (outcome) at several moments in time. Avg_Num is the time point indicator and represents a three-day average around daylight savings time (DST), with 0 indicating the three-day average for that outcome begins with the DST date for Fall 2021 – that is, Nov 7 to Nov 9. The average for Nov 10 to Nov 12 is indicated by a 1 for Avg_Num; and Nov 4 to Nov 6 is −1, Nov 1 to Nov 3 is −2, etc. I created dummy-coded variables for each Avg_Num such that Neg_4_FA_21 = 1 means that Avg_Num = −4 (the 4^th 3-day average before Nov 7; Oct 26-28) for that row/average; Neg_3_FA_21 = 1 means that Avg_Num = −3 (the 3^rd 3-day average before Nov 7; Oct 29-31) for that row/average; Zero_FA_21 = 1 means that Avg_Num=0 (Nov 7-9), etc.

See sample data below:

user_id	Avg_Num	outcome	Neg_4_FA_21	Neg_3_FA_21	Neg_2_FA_21	Zero_FA_21	Pos_1_FA_21	Pos_2_FA_21	Pos_3_FA_21
300	-4	478.333	1	0	0	0	0	0	0
300	-3	461.333	0	1	0	0	0	0	0
300	-2	389.667	0	0	1	0	0	0	0
300	-1	458	0	0	0	0	0	0	0
300	0	499	0	0	0	1	0	0	0
300	1	384.667	0	0	0	0	1	0	0
300	2	460.667	0	0	0	0	0	1	0
300	3	381.667	0	0	0	0	0	0	1
304	-4	96.5	1	0	0	0	0	0	0
304	-3	130.5	0	1	0	0	0	0	0
304	-2	285	0	0	1	0	0	0	0
304	-1	85	0	0	0	0	0	0	0
304	0	125.5	0	0	0	1	0	0	0
304	2	383.667	0	0	0	0	0	1	0
304	3	574.5	0	0	0	0	0	0	1
305	-4	232.333	1	0	0	0	0	0	0
305	-3	516	0	1	0	0	0	0	0
305	-2	493	0	0	1	0	0	0	0
305	-1	405.667	0	0	0	0	0	0	0
305	0	441.333	0	0	0	1	0	0	0

The goal is to examine whether the outcome at each three-day average differs from average at the reference point, which is three days before Nov 7 (Nov 4-6), where Avg_Num = −1 and Neg_1_FA_21 = 1. Therefore, my effects in the model would be Neg_4_FA_21, Neg_3_FA_21, Neg_2_FA_21, Zero_FA_21, Pos_1_FA_21, Pos_2_FA_21, and Pos_3_FA_21 (skipping the reference dummy of Neg_1_FA_21).

Based on discussions with my colleagues, I would like to run an ordinary least squares regression with a fixed effect for the subject (user_ID) and clustered standard errors to represent the repeated-measures nature of the data. Here is the code I've come up with based on some Googling:

proc glm; absorb user_ID;
 model outcome = Neg_4_FA_21 Neg_3_FA_21 Neg_2_FA_21 Zero_FA_21 
Pos_1_FA_21 Pos_2_FA_21 Pos_3_FA_21;
repeated user_ID / type=un; run;quit;

Could anyone please provide me guidance on whether this code is correct? For example, do I need the "repeated" statement if I have the "absorb" statement? I'm more familiar with PROC REG and MIXED than with GLM, and I'm not sure if this is the best way to examine the change in my outcome around Nov 7. In fact, I'd much rather do the analyses in PROC MIXED, but I don't know how to do a fixed effect for subject in PROC MIXED, only a random effect. Any help is appreciated.

SteveDenham · Posted 09-23-2022 02:29 PM

I am pretty sure the two are connected. By setting -1 as the reference, the matrix is set up so that the last entry (set to zero) is the reference. This disturbs the ordinal nature of your data, as the estimate of the correlation used in AR(1) is dependent on the ordering and not on an index. As a result, rho is estimated from values at -4, -3, -2, 0, 1, 2, 3, -1, rather than the ordinal values from -4 to 3. That will increase the standard error in this case, and cause a small change in the point estimates.

If you wish to compare the other timepoints to -1, there are better ways than setting that group as a reference and examining the solution vector, and those preserve the ordinal nature of your data. You can compare the LSmeans at the various other timepoints to that timepoint by using an LSMESTIMATE statement, like this:

proc mixed NOCLPRINT NOITPRINT;
class user_ID Avg_Num; /* (ref='-1'); -> removed as a midrange reference point can create problems when you start creating LSMEANS */
model outcome = Avg_Num / SOLUTION;
repeated Avg_Num/ subject=user_ID type=ar(1); /* not critical, but it makes it apparent immediately what the repeated factor is */
lsmeans avg_num/cl;
lsmestimate avg_num '-4 v -1' -1 0 0 1 0 0 0 0,
                     '-3 v -1' 0 -1 0 1 0 0 0 0,
                     '-2 v -1' 0 0 -1 1 0 0 0 0,
                     '0 v -1'  0 0 0 1 -1 0 0 0,
                      '1 v -1' 0 0 0 1 0 -1 0 0,
                      '2 v -1' 0 0 0 1 0 0 -1 0,
                      '3 v -1' 0 0 0 1 0 0 0 -1/ cl /* add in any multiple comparisons method if you want */;
run;

SteveDenham

View solution in original post

SteveDenham · Posted 09-23-2022 09:22 AM

I would do this in MIXED as well. It should be easy enough, and I don't see any reason to create all of the dummy variables - the CLASS statement should do that for you. Where things get complicated is the request to do this using OLS, where MIXED is a likelihood based program. You could specify method=mivque0 to prevent iterative calculation of the likelihood. Don't try something like method=type3, as that applies only to variance component models with no subject effects and no repeated statement. However, I would prefer the default REML method to the OLS method.

If you are interested solely in the marginal response, look at PROC GEE as an alternative. Search this forum for posts by @StatDave for help on that approach.

SteveDenham

confooseddesi89 · Posted 09-23-2022 12:16 PM

Hello,

Thanks for this reply. I much prefer PROC MIXED. Could you take a look at the following code? Also, is the "fixed effect" for subject represented in the "repeated" line of my code?

proc mixed NOCLPRINT NOITPRINT;
class user_ID Avg_Num (ref='-1');
model outcome = Avg_Num / SOLUTION;
repeated / subject=user_ID type=ar(1);
run;

Thank you!

SteveDenham · Posted 09-23-2022 01:43 PM

I don't think that will do exactly what you want, but without some data to play with I can't be sure. For instance, given your design I would try:

proc mixed NOCLPRINT NOITPRINT;
class user_ID Avg_Num; /* (ref='-1'); -> removed as a midrange reference point can create problems when you start creating LSMEANS */
model outcome = Avg_Num / SOLUTION;
repeated Avg_Num/ subject=user_ID type=ar(1); /* not critical, but it makes it apparent immediately what the repeated factor is */
run;

So in this case user_ID is a fixed effect, so any inference refers only to those users included in your data. If you wish to expand to all possible users, you will need a statement like:

RANDOM intercept/subject=user_ID;

However, convergence or even estimation may require more data than you have available.

SteveDenham

confooseddesi89 · Posted 09-23-2022 02:08 PM

Okay, it looks like the changes you made were (1) removing the -1 reference point (I added it back because I need this as the reference), and (2) adding "Avg_Num" after "repeated."

Could you explain to me what (2) does? With this addition, according to the output for PROC MIXED on my full dataset, the estimates are lower, the standard errors are higher, and the p-values are higher. If you absolutely need some sample data to assist, I could provide that, but for privacy reasons I would prefer not to.

Thanks again.

SteveDenham · Posted 09-23-2022 02:29 PM

I am pretty sure the two are connected. By setting -1 as the reference, the matrix is set up so that the last entry (set to zero) is the reference. This disturbs the ordinal nature of your data, as the estimate of the correlation used in AR(1) is dependent on the ordering and not on an index. As a result, rho is estimated from values at -4, -3, -2, 0, 1, 2, 3, -1, rather than the ordinal values from -4 to 3. That will increase the standard error in this case, and cause a small change in the point estimates.

If you wish to compare the other timepoints to -1, there are better ways than setting that group as a reference and examining the solution vector, and those preserve the ordinal nature of your data. You can compare the LSmeans at the various other timepoints to that timepoint by using an LSMESTIMATE statement, like this:

proc mixed NOCLPRINT NOITPRINT;
class user_ID Avg_Num; /* (ref='-1'); -> removed as a midrange reference point can create problems when you start creating LSMEANS */
model outcome = Avg_Num / SOLUTION;
repeated Avg_Num/ subject=user_ID type=ar(1); /* not critical, but it makes it apparent immediately what the repeated factor is */
lsmeans avg_num/cl;
lsmestimate avg_num '-4 v -1' -1 0 0 1 0 0 0 0,
                     '-3 v -1' 0 -1 0 1 0 0 0 0,
                     '-2 v -1' 0 0 -1 1 0 0 0 0,
                     '0 v -1'  0 0 0 1 -1 0 0 0,
                      '1 v -1' 0 0 0 1 0 -1 0 0,
                      '2 v -1' 0 0 0 1 0 0 -1 0,
                      '3 v -1' 0 0 0 1 0 0 0 -1/ cl /* add in any multiple comparisons method if you want */;
run;

SteveDenham

confooseddesi89 · Posted 09-23-2022 03:03 PM

Interesting that removing the reference had such an effect on the estimates.

Thanks so much for your help.

OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

Re: OLS repeated-measures regression with fixed effect for subject and clustered standard errors

SAS Innovate 2025: Call for Content