BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
BlueNose
Quartz | Level 8

Hello all,

I am considering my options for analyzing a single arm multi-center study. I have a small pilot study with only one treatment group, without a control. There are two main variables of interest:

1. A "semi-continuous" one, theoretically was suppose to get discrete values between 1-10. In practice, all values apart from two are either 0,1,2.

2. A binary variable: success or failure.

There is no one independent variable of main interest since it's a single arm study, however there are a few stratifying variables like gender, and severity of the problem.

Every patient gave 1 or 2 samples. Some people had the problem in two places in their body and were treated twice. So I have a repeated measures with missing values. From a brief look it seems like the correlation within a subject is around 0.85, since a majority of patients who had some value in the first sample, had the exact same value in the second.

I wanted to ask how would you approach this kind of analysis, in particular the binary outcome. If I didn't have the repeated measures, I would probably calculate a point estimate and the exact Clopper-Pearson CI using PROC FREQ. It makes sense.

But I do have the repeated measures, and as secondary analysis I have the stratification by 2 or 3 variables. If I want to add more complications, I can also treat the center as a random effect. I don't have to, it can be a fixed effect too.

So how would you deal with it ? Should I use PROC GENMOD for GEE ? Or is it PROC GLIMMIX for GLMM ? I am confused, not sure which way to turn. I want to keep the simplicity as possible, but don't want to make big mistakes that will lead to false estimates. The primary target should be to get an estimate of the success rate, with a CI. Or in other words, to get an adjusted estimate taking into account the repeated measures.

Thank you !

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

With X1 as a class variable, the intercept estimate is the effect at the reference level, so I may have confused the issue earlier.  One trick I have used is to construct a dummy continuous variable that is identically one for all cases.  For reference call this D1.  Then try fitting (WARNING:untested code):

proc glimmix data=....method=quad;

class SubjID; /* Note that the response variable should not be included in the class statement in GLIMMIX */

model Y=D1/dist=binary solution;

random intercept/subject=subjid;

nloptions tech=nrridg;

run;

Now all differences should be captured in the intercept, with the covariance matrix accommodating the pairing when it occurs.

View solution in original post

13 REPLIES 13
SteveDenham
Jade | Level 19

In some sense, you can treat this as a multi-arm study.  The fixed effects define the arms, rather than serving as stratifying variables.  Suppose you had only sex as a fixed effect.  You would be testing whether males and females differed on the dependent variables.  That (at least to me) is a perfectly logical analysis of this sort of data.  What you consider as secondary analyses become the only analyses that have any comparisons.

Now either GENMOD or GLIMMIX can be used, depending on your decision regarding center as either a fixed (GENMOD) or random (GLIMMIX) effect.  That decision comes down to whether you wish to infer to future studies at the centers measured (fixed effect) or to the universe of possible centers (random effect).

I would fit a model using GLIMMIX.  The estimate of interest will be the intercept of the model, so the parameterization chosen will make a huge difference.  If you choose a reference parameterization for the stratifying variables, then the intercept will be the proportion for the reference categories.

Steve Denham

BlueNose
Quartz | Level 8

Thank you Steve for helping, I appreciate it.

I tend to think that the center should be a random effect, and thus I agree with you that GLIMMIX could be a better option than GENMOD (however I will try both to see any differences).

There is one thing in your answer I still don't fully understand. My main interest is the proportion of success with the matching CI, with and without stratification by the several factors. For example, if I had 28 successes out of 30 samples, my estimator would be 93.3% and I could calculate a CI. The difference here is that the 30 (or other number) is not 30 but let's say 55 since some patients (most of them) gave two samples. You mentioned the intercept, will it estimate this proportion ?

Something else that bothers me with GLIMMIX, is that last time I used it I read that it's better to enter the data in the format of y/n rather than in the format I would use for let's say logistic regression (0/1). Is it true ? Should I create a row per person with a variable of the number of successes out of 1 or 2 samples ? It makes no sense to me...

To summarize, my main interests and problems are:

* I want the general success proportion with a CI

* Most subjects gave 2 samples, a minority gave 1

* There is a very high success rate (nearly 100%) - thus I am very interested mainly in the lower limit of the CI

* There is a very high correlation, i.e., except for 2 patients, in all others if one sample was a success, so was the other

I am trying to think of a code, will this do the work or am I far from it ?

proc glimmix data=.... method=quad

     class Y_success X1_gender X2_?? Z1_center subjectID;

     model Y_success = X1_gender X2_??? / dist=binary link=glogit;

     random int / subject=subjectID;

     random int / subject=center;

run;

should I use the residual command instead of random ?

Thank you in advance !

SteveDenham
Jade | Level 19

Well except for that Z1_ on the center variable in the class statement, I think you are getting there.  I would stay away from the residual option on the random statement for binomial distributions.  You may want to add an NLOPTIONS statement such as:

nloptions tech=nrridg;

as the ridged Newton-Raphson tends to be more forgiving when fitting a binary distribution.  And why the generalized logit link?  If you have only two values, the canonical logit link should suffice.

Steve Denham

BlueNose
Quartz | Level 8

sorry for the big delay. I ran the procedure and something is not right. I'll start with some more details. I have a binary outcome, with 14 values of 0 and 23 of 1. These measurements are not independent, most of them, however not all, come as pairs, meaning that most subjects contributed 2 observations, a minority contributed one. In most cases, a vast majority, if one observation within a subject was 0, so was the other, and if it was 1, so was the other. This happened in vast majority of cases, so the expected correlation within a subject is expected to be around 0.9.

Since this is a one arm study, I took your advice and chose a fixed variable of secondary interest, I will call it X1 (Y is my outcome), and I ran this code:

proc glimmix data=... method = quad / laplace;

  class SubjectID X1 Y;

  model Y = X1 / dist = binary link = logit solution;

  random int / subject=SubjectID;

  nloptions tech=nrridg;

run;

I used once the quad method, once the laplace method and once with none of them, i.e. the Residual PL (I forgot to type the method part so it happened).

The next table summarize the results, which I find weird:

MethodGener / Pearson. Chi-Square / DFCovariance Parameter Estimates SubjectID intercept (s.e)Fixed effect intercept estimate (s.e)-2 Res Log Pseudo-LikelihoodGeneralized / Pearson Chi-Square
Residual PL 0.543.24 (2.21)-0.128 (0.67)17019.5
Quad0.0860.59 (94.8)-0.5651 (2.44)4.73
Laplace0.06589.91 (91)-8.95 (3.09)3.12.17

According to the results, something is definitely wrong!

Again, my primary interest is the proportion of observations with Y=1, taking into account the fact that these observations are not independent. I also with to estimate the within subject correlation if possible. The fixed effect, which here was not significant for X1 (as expected) is less important.

In general, how do I estimate the proportion I want using the intercept like you suggested ? And probably even before that, what is wrong with my model (according to the results - the answer must be everything).

Thanks !!

SteveDenham
Jade | Level 19

With X1 as a class variable, the intercept estimate is the effect at the reference level, so I may have confused the issue earlier.  One trick I have used is to construct a dummy continuous variable that is identically one for all cases.  For reference call this D1.  Then try fitting (WARNING:untested code):

proc glimmix data=....method=quad;

class SubjID; /* Note that the response variable should not be included in the class statement in GLIMMIX */

model Y=D1/dist=binary solution;

random intercept/subject=subjid;

nloptions tech=nrridg;

run;

Now all differences should be captured in the intercept, with the covariance matrix accommodating the pairing when it occurs.

BlueNose
Quartz | Level 8

Thank you Steve. I have to admit, your idea of creating a "continuous dummy" is creative, I like it.

I ran the code, and this is the output. I am not sure how to check if it worked, i.e, how to get the probability / proportion I look for. If I use descriptive statistics of Y, the value 1 appears in 62.16% of observations and 0 in 37.84%. I guess that I look for a number not too far from this, only adjusted for the repeated measures.

How do I get it from -2.24, and what does 35.39 mean in this context, I mean, how to I transfer this number to the scale of my problem ? And one more thing, do I remember correctly that Pearson Chi Square / DF SHOULD be below 1 ?

Thank you VERY much.

Capture.JPG

SteveDenham
Jade | Level 19

The estimate is on the logit scale, so plugging into the back transformation of the logit should give a value on the proportion scale.  I get exp(-2.2429)/(1+exp(-2.2429)) = 0.095--not quite what we would expect.  However, a rough 95% confidence interval is 0.3% to 79.4%, which does include your raw proportion value.

That 35.3971 is a HUGE variance component given what is going on, and is the cause of the low ratio value.  What I would call at this point is that you really need more data, which probably isn't going to happen.  The other thing to try is to estimate the marginal, rather than conditional, estimate.  For that, try (again UNTESTED code):

proc glimmix data=....;

class SubjID index; /* The variable index would indicate the first or second observation for the subject*/

model Y=D1/dist=binary solution;

*random intercept/subject=subjid; /* Removed to make this an R side, marginal model */

random index/residual subject=subjid type=chol;

nloptions tech=nrridg;

run;

Steve Denham

Message was edited by: Steve Denham

BlueNose
Quartz | Level 8

Oh, so the estimates are just like in the logistic regression case, I see. Are you doing the same calculations for the CI limits ?

How did you know that 35.39 is a huge variance component, what did you compare it to ?

I ran your code and it didn't converge. So I have replaced the variance structure to VC and it did run, giving me an estimate of -1.54 (still not close enough). I tried CS and UN, did not converge.

You mentioned the word marginal. I have a beginners question, what is the difference between GEE and what you did ?

I tried GEE after what you said, I ran this code:

proc genmod data=...;

     class subjectID;

     model Y = D1 / dist = bin;

     repeated subject = SubjectID / corr=unstr corrw;

run;

I got an estimate of 0.6553. If the logit function is the same here, it gives me a proportion of 0.658, much closer to what I look for. I tried the same with corr=cs and got the exact estimate. The Exchangeable Working Correlation was 0.972, which is slightly higher than I expected, but I did expect a high correlation around 0.9. When I did corr=ind I got an estimate of 0.4964, which gives a proportion of 0.62.

It looks like it is working somehow good, which correlation structure should I choose. And out of curiosity, why the marginal model needs less data than the conditional to run properly ? How should I know if to choose GEE or GLMM ?

SteveDenham
Jade | Level 19

/Dodges important question at the end to answer other stuff first

How did I know that 35.39 is a huge variance component?

     I compared it to the estimate, and it was more than an order of magnitude larger.  Just a simple comparison, not a rigorous statistical comparison.

For the types that did not converge, was it that the iteration history reached the maximum (20 iterations) but the likelihood function was stabilizing, or was the likelihood function jumping around?  If the former is the case, try adding and NLOPTIONS statement:

nloptions maxiter=200;

GEE is a marginal estimation, and GENMOD is an excellent approach to doing GEE.  GLIMMIX is much more suited to the conditional estimation (GLMM), which is better in my opinion, as it treats the subjects as a random sample from the population to which you want to apply inference.  Inference space is the key here, with GEE's applying to a narrower inference space (measurements taken from the same subjects repeatedly, so that your data is one instance of the possible outcomes on the specified subjects) and GLMM's applying to a broader inference space (measurements taken from a random sample from a larger population, repeatedly, so that your data is one instance of the possible outcomes on all possible subjects).  For the GLMM, the standard errors are going to be larger, and the estimates show less bias due to regression to the mean.

Steve Denham

BlueNose
Quartz | Level 8

You were right, when I increased the number of iterations it worked, this is the output, getting closer, but not close enough, unless, it is modeling P(Y=0) instead of P(Y=1), is it possible ?

What you said is very interesting, about conditional vs. marginal. Is there a good reference where I can read deeper about the differences between these two approaches ?

Capture.JPG

SteveDenham
Jade | Level 19

Yes. The default in GLIMMIX is to model the probability of the first category, which would be zero in this case. Try:

model Y(ref='1')=D1/dist=binary solution;

and see what occurs.

Steve Denham

BlueNose
Quartz | Level 8

I ran a few models. First of all, GLIMMX, when I added Y(ref='1'), nothing has changed, but oddly enough when I did Y(ref='0') I got numbers that makes sense...

I got that the estimator of p is e^0.6575 / 1+e^0.6575 = 0.658 with a CI of [0.433,0.8309]

I also ran a GEE with GENMOD, using the corr=ind I got an estimate of 0.621 with CI [0.406,0.7979]. Using corr=cs I got 0.658 with CI [0.449,0.8196] and using corr=un I got similar results to corr=cs. The correlation was 0.972

I also found a chapter in the book 'statistical methods for rates and proportions' of Fleiss, talking about a one sample with correlated data. He proposed calculating the point estimate as if there wasn't a correlation and proposed a method for calculating an adjusted CI using an adjusted variance while calculating and taking into account the intraclass correlation. I got a non adjusted estimate of 0.621 with CI of [0.415,0.827]. The intraclass correlation was 0.915.

If I would have ignored the correlation entirely, I would get p=0.621 with CI of [0.44,0.775] using Clopper-Pearson or [0.461,0.7594] using Wald score CI.

Now my final question for this thread is, how am I suppose to choose my point and interval estimators based on all this data ?

🙂

Thank you !

P.S - My codes:

proc glimmix data=...;

  class SubjectID index; /* The variable index would indicate the first or second observation for the subject*/

  model Y (ref='0') = D1 / dist=binary solution cl;

  random index/residual subject=SubjectID type=chol;

  nloptions tech=nrridg;

  nloptions maxiter=200;

run;

proc genmod data=... descend;

      class SubjectID;

      model Y=D1 / dist=bin;

      repeated subject=SubjectID / corr=cs corrw;

run;

SteveDenham
Jade | Level 19

Well, I wouldn't ignore the correlation entirely Smiley Happy, since it seems to be >0.9, no matter how you calculate it.

So, with that, I would select the covariance structure based on the minimum corrected AIC value amongst the candidate covariance structures.

Steve Denham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 13 replies
  • 4979 views
  • 6 likes
  • 2 in conversation