BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
RHASAS2020
Calcite | Level 5

Hi all. I've been lurking the forums for a while as I've been working various projects in SAS and I've come to an issue I can't quite find an answer to. I'm conducting an interrupted time series study using multiple subjects (students) with a binary outcome variable measured each year. I've been going back and forth between using 1) proc autoreg, where I cannot include school-level random effects and 2) using proc glimmix, where I can use random effects but am having trouble figuring out how to incorporate the autoregressive correlation structure. I would ideally be able to incorporate both school-level random effects and the autoregressive correlation structure and am seeking input on how to accomplish this.

 

Below are my example codes for each:

proc autoreg data=data;
model outcome = treat year time_since_treat  /
method=ml nlag=1 backstep dwprob loglikl;
output p=pvar r=rvar;
run;

and

proc glimmix data=data ;
model outcome(event='1')= treat year time_since_treat
/ solution dist=binary link=logit;
random intercept / subject=school;
COVTEST ZEROG;
COVTEST/WALD;
run;

Much appreciated for any insights. 

 

1 ACCEPTED SOLUTION

Accepted Solutions
StatDave
SAS Super FREQ
Some considerations on whether to use the GEE model - it requires a large number of clusters (schools, I assume in this case) of correlated observations and it allows you to make inferences on the population level rather than for making predictions at the individual level. Time and memory needs will increase with the number of observations in a cluster. The 0-3 response sounds like it is ordinal, so using the default link (CLOGIT, not GLOGIT) might be more appropriate as it will treat the response as ordinal. If you want to assume that the correlations within a cluster diminish over time in an autoregressive way, then you could specify the TYPE=AR option in the REPEATED statement. If you are going to do that, then you need to also specify the WITHIN= option with a variable that order the observations in the clusters - perhaps your YEAR variable - but if the years are different for different clusters, then you might want to make a variable that simply orders the observations in a common way with values 1, 2, 3, ... . Otherwise all unique years could result in the clusters being very large and make the model infeasible. Since the GEE method is robust to misspecification of the correlation structure, you could also consider using the simple independence (TYPE=IND, the default) or exchangeable (TYPE=EXCH) structure and then you don't even need to specify WITHIN=. I don't understand the need for aggregation and a FREQ statement - if schools are the clusters and your data contain multiple observations in each school with each observation representing a single unit (such as person), then you wouldn't need a FREQ statement.

View solution in original post

10 REPLIES 10
Ksharp
Super User
If I was right, the OUTCOME variable of PROC AUTOREG can not be binary variable, it should be continuous .
if you want "incorporate the autoregressive correlation structure", code should like :

random intercept / subject=school;
random year/ subject=school type=ar(1) residual ;

@SteveDenham could give you better code.
SteveDenham
Jade | Level 19

You need to watch out for type=AR(1) or type=ARH(1) for interrupted time spacing as these structures assume equal spacing of the observations. A possible substitute would be a spatial power structure ( SP(POW)(t), where t is a continuous variable that is equal to the time at which an observation is made). As t approaches equal spacing, this structure approaches the AR(1). Other possibilities for the variance/covariance structure would be to try PSPLINE which fits a penalized B spline to a G side random effect or RSMOOTH which fits a radial smoother, again to a G side random effect. Depending on the amount of data and the number of timepoints you have, you might be able to fit an R side UNSTRUCTURED or CHOLESKY structure.  Without knowing what your data layout is, I hesitate to suggest the latter two, and would recommend the SP(POW)(t) structure, unless the interruption is huge compared to the intervals where you have observations..

 

SteveDenham

RHASAS2020
Calcite | Level 5

Thank you for these responses, @Ksharp and @SteveDenham . To give a little bit more information about my data, my time series outcome is looking at enrollment in a specific program (i.e. 0 for not enrolled and 1 for enrolled) among the total population of a school district. The outcome is measured once per year. I have other covariates included in my model as well. Knowing this, does this change the solution?

 

I have tried the proposed ar(1) solution and the output doesn't include estimates for the covariates... it stops after listing parameters under the "Dimensions" heading. My current code is as follows:

 

proc glimmix data=data NOCLPRINT NOITPRINT ;
model outcome(event='1') = treat year time_since_treat x1 x2 x3
/ solution dist=binary link=logit;
random intercept / subject=school;
random year / subject=school type=ar(1) residual;
COVTEST ZEROG;
COVTEST/WALD;
run;
SteveDenham
Jade | Level 19

When GLIMMIX stops (or maybe even doesn't start iterating), are there ERRORs, WARNINGs, or NOTEs in the log that indicate what might be happening? There may also be messages in the output file that can help diagnose the issues.  Let us know what these might be.

 

SteveDenham

RHASAS2020
Calcite | Level 5

Yes, the message "WARNING: Obtaining minimum variance quadratic unbiased estimates as starting values for the covariance parameters failed" appears in the log. 

 

That said, as I'm thinking through this on a more conceptual level, I'm wondering if conventional autocorrelation even needs to be accounted for, as each time period has completely different subjects from each other... 

SteveDenham
Jade | Level 19

Since subject is completely nested in time period, you can't fit an R side covariance structure. This explains the inability to calculate starting values that are consistent estimators. So what you have is a factorial design, and I don't see any reason to include random or random residual effects associated with time. You probably should retain the random effect of school. What happens in BGLIMM with that approach?

 

SteveDenham. 

RHASAS2020
Calcite | Level 5

Thank you Steve - my original approach works when I drop time effects, and the model also works in PROC BGLIMM. 

 

This follow-up question may be a better fit for a new thread, so please let me so if that is the case and I will mark your post as the accepted solution. After speaking with a colleague, it sounds like we are going to shift to a multinomial logit design as it better theoretically fits the question - i.e. with outcomes of 0-3 for no participation in any program/participation in related programs, rather than assessing each independently. When I do this (code below), the model fails to run because it cannot be fit in a reasonable amount of time, even when I aggregate schools up to the district level for random effects. The version of SAS I am using does not have multinomial modeling built into proc bglimm. Any suggestions? 

 

proc glimmix data=data NOCLPRINT NOITPRINT ;
class school;
model program = treat year time_since_treat x1 x2 x3
/ solution dist=multinomial link=glogit;
random intercept / subject=school;
run;
SteveDenham
Jade | Level 19

A generalized estimating equation (PROC GEE) may be something you could try. What happens if you try this:

 

proc gee data=data NOCLPRINT NOITPRINT ;
class school program;
model program = treat year time_since_treat x1 x2 x3
/ solution dist=multinomial link=glogit;
repeated subject=school/ within= <need a variable indexing the order of measures within school>
run;

This would require some aggregation to get a single observation for each level of program within school. At this point @StatDave may have some suggestions regarding a couple issues that have me stumped.  For instance, what is the within subjects ordering variable, and will a FREQ statement work (in case of multiple observations of program within school)?

 

SteveDenham

StatDave
SAS Super FREQ
Some considerations on whether to use the GEE model - it requires a large number of clusters (schools, I assume in this case) of correlated observations and it allows you to make inferences on the population level rather than for making predictions at the individual level. Time and memory needs will increase with the number of observations in a cluster. The 0-3 response sounds like it is ordinal, so using the default link (CLOGIT, not GLOGIT) might be more appropriate as it will treat the response as ordinal. If you want to assume that the correlations within a cluster diminish over time in an autoregressive way, then you could specify the TYPE=AR option in the REPEATED statement. If you are going to do that, then you need to also specify the WITHIN= option with a variable that order the observations in the clusters - perhaps your YEAR variable - but if the years are different for different clusters, then you might want to make a variable that simply orders the observations in a common way with values 1, 2, 3, ... . Otherwise all unique years could result in the clusters being very large and make the model infeasible. Since the GEE method is robust to misspecification of the correlation structure, you could also consider using the simple independence (TYPE=IND, the default) or exchangeable (TYPE=EXCH) structure and then you don't even need to specify WITHIN=. I don't understand the need for aggregation and a FREQ statement - if schools are the clusters and your data contain multiple observations in each school with each observation representing a single unit (such as person), then you wouldn't need a FREQ statement.
jiltao
SAS Super FREQ

The error message indicated some convergence issues for your model.

Your model might not be appropriate for your data, but from your description I am not sure how your data look like. Is it possible to display data for some of your schools?

Thanks,

Jill

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 10 replies
  • 1519 views
  • 1 like
  • 5 in conversation