Kojema Tracker
https://communities.sas.com/kntur85557/tracker
Kojema TrackerWed, 13 Nov 2024 20:55:23 GMT2024-11-13T20:55:23ZHigh standard errors on estimates of p in logistic regression using categorical predictors
https://communities.sas.com/t5/Statistical-Procedures/High-standard-errors-on-estimates-of-p-in-logistic-regression/m-p/510656#M26147
<P>Hello,</P>
<P>I am performing logistic regression in PROC GLIMMIX (binomial distribution, logit link, SAS version 9.4) using categorical predictors and am coming across a strange phenomenon:</P>
<P>If the response variable of certain level of a categorical predictor comprioses mostly or all zero values (my data are represented by 0s ands 1s -I am modelling the probability of occurrence of 1s), when I perform lsmean estimates of the probability of occurrence of 1 (p), the standard error and breadth of 95% CIs for the estimate of p for that level is high; if all values of that level are zero the SE is very high and the 95% CI ranges from 0 to 1. As a result, pairwise post-hoc comparisons made with this level are rarely significant (or never if all values are 0), even though I might expect them to be.</P>
<P>I find this odd because I thought that standard error for a binomially distributed variable becomes smaller when, for a given n, p approaches 0 (or 1).</P>
<P> </P>
<P>I have posted some example data and code for a simple example with one independent categorical variable (Cat1).</P>
<P>For level "RE", all values but one of the responding variable "Y1" are 0. For the responding variable "Y2" all values for level "RE" are 0. If you run the code, below, you can see that the SE for RE is high in the former case, and very high (with 95%CIs of p ranging from 0 to 1) in the latter case.</P>
<P> </P>
<P>Note-I am using GLIMMIX because I am ultimately included random factors in this model, but I did a bit of experimenting in PROC genmod and proc logistic and I think the same thing more or less occurs.</P>
<P> </P>
<P>Can anyone explain what causes this phenomenon, or -in the case it is an error in my coding-what I may be doing wrong?</P>
<P> </P>
<P>Thanks for any help! </P>
<P> </P>
<P>*case where Y1 has only one non-zero value in Cat1="RE";</P>
<P> </P>
<P>PROC GLIMMIX data=Fake_Data2;<BR /> CLASS Cat1 ;<BR /> MODEL Y1 = Cat1 / DIST=bin LINK=logit SOLUTION;<BR /> LSMEANS Cat1 / CL ADJUST= tukey ilink;<BR /> RUN;</P>
<P> </P>
<P>*case where Y2 has all zero values in Cat1="RE";</P>
<P> </P>
<P>PROC GLIMMIX data=Fake_Data2;<BR /> CLASS Cat1 ;<BR /> MODEL Y2 = Cat1 / DIST=bin LINK=logit SOLUTION ;<BR /> LSMEANS Cat1 / CL ADJUST= tukey ilink;<BR /> RUN;</P>Tue, 06 Nov 2018 01:01:20 GMThttps://communities.sas.com/t5/Statistical-Procedures/High-standard-errors-on-estimates-of-p-in-logistic-regression/m-p/510656#M26147Kojema2018-11-06T01:01:20ZHow to code proc mixed with repeated statements for blocking and repeated measures in time?
https://communities.sas.com/t5/Statistical-Procedures/How-to-code-proc-mixed-with-repeated-statements-for-blocking-and/m-p/340444#M17918
<P>Hello,</P>
<P>I am performing analyses on water quality data. I want to test whether there are difference in average TP concentrations between 5 sites, and whether these differences depend on the year of sampling.</P>
<P>I have three years of data: 12 samples were taken between March and October in 2014, 11 samples were taken between March and October in 2015, and 9 samples were taken between March and October in 2016.</P>
<P>For every sampling date measurements were taken at all 5 sites. I have attached some “fake” data so you can see how it is structured.</P>
<P>I want to “pair” samples by date to account for variability caused by weather etc. on a given date. I also want to “correct for” temporal autocorrelation- samples taken close together in time are more similar than those taken further apart. I am assuming samples taken in a given year are independent from samples taken in other years. The intervals between measurements are not equal so I am theoretically unable to use variance-covariance structures such as AR(1), toep, arma(1,1), and ante(1).</P>
<P>I have attempted the above in proc mixed with the code at the bottom of the post. To block by day I have included a repeated statement with Date(Year) as the subject. Analyses without accounting for temporal autocorrelation showed an unstructured variance-covariance structure to be the best fit. I am allowing different var-cov matrices for each year with the group=year option. To account for temporal (residual) autocorrelation I added another repeated statement, with Site as the subject, that models the structure of temporal autocorrelation using the sp(exp) option and the variable “Day_nmbr”, which is a continuous variable representing the day as a number between 1 and 365. Note that I also have “Date” as a class variable for purpose of blocking, above. Again, I used the group=year option to allow a different autocorrelation structure in each year. Note-I haven’t explored the best residual autocorrelation model yet, but sp(exp) is a likely fit.</P>
<P>When I run this code with either one of the repeated statements it works, but when I run it with both statements SAS only uses the second statement (I get this message: WARNING: Only the last REPEATED statement is used). I am presuming (or perhaps hoping!) that I have just coded something incorrectly. Perhaps there is a way to combine the repeated statements? Can anyone help me think of a way to code that will allow me to do what I have described?</P>
<P>Thanks for any help!!</P>
<P> </P>
<P>Code:</P>
<P> </P>
<P><STRONG>Proc</STRONG> <STRONG>mixed</STRONG> data=sites;</P>
<P>class Date Year Site;</P>
<P>model TP = Site Year Site*Year / solution ddfm=kenwardroger outp= pred_data;</P>
<P>Repeated / subject= Date(Year) type=un group=year r rcorr;</P>
<P>Repeated / subject= Site type=sp(exp)(Day_nmbr)group=year r rcorr;</P>
<P><STRONG>run</STRONG>;</P>
<P> </P>
<P>I am using SAS 9.3</P>Mon, 13 Mar 2017 15:01:05 GMThttps://communities.sas.com/t5/Statistical-Procedures/How-to-code-proc-mixed-with-repeated-statements-for-blocking-and/m-p/340444#M17918Kojema2017-03-13T15:01:05ZRe: Proc Glimmix: RM ANCOVA
https://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103521#M5481
<HTML><HEAD></HEAD><BODY><P>Thanks for this Steve,</P><P>Good food for thought.</P></BODY></HTML>Thu, 21 Mar 2013 15:39:51 GMThttps://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103521#M5481Kojema2013-03-21T15:39:51ZRe: Proc Glimmix: RM ANCOVA
https://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103519#M5479
<HTML><HEAD></HEAD><BODY><P>Hi, just thought I'd update the progress of this analysis, and add on a discussion topic</P><P></P><P>I ended up not using month as a variable, and used "season" instead. It was too onerous to interpret differences between individaul months, and seasonal differences are what I was really interested in on the time side of things. As a nice side effect, this made my model run much more smoothly. This also allowed me to take out period of sampling as a repeated measure, as I figured seasonal differences would be great enough that temporal autocrrelation between seasons wouldn't be a major issue.</P><P></P><P>I also made another decision that could be controversial....if anyone could lend me some advice that would be excellent....:</P><P></P><P> I originally had "stream" as a random block, as multiple measurements were taken per stream and I wanted to account for unexplained within-catchment variability. However, none of the estimated coefficients for individual streams were significantly different than zero. And the "covtest", or test of covariance parmaters, had a p-value of 1, indicating no significant difference from independence. Thus I decided to remove stream as a random block. In the analysis of total nitrogen, this gave equivalent AIC values, an identical model (nonsignificant terms were removed sequentially), and almost identical LSMEANS etc. For the analysis of total phosphorous, however, removing "stream" as a random block resulted in a much smaller AIC (160 units lower!), indicating a more parsimonious model, but the final model ended up being different, as were estimates of coefficients. Importantly, without the random block, there were significant differences between watersheds with different agriculutral intensities, and with the random block, there were not. Not that I am trying to bias my analysis toward the results I want, but I am concerned that leaving in a variable that is apparently nonsignificant is clouding my ability to detect the differences I am interested in. I am thus somewhat torn about whether to remove the random block or not., as I've always been tought to remove non-significant variables, but stream was a design factor (Steve I know you said to keep the design factors!). However, maybe this is telling me that when things like agricultural intesity, ecoregion, runoff, season, etc. are known, nutrient export variation between watersheds is pretty well acounted for...</P><P></P><P>Pre-thanks for any more comments, and sorry for rambling!</P><P></P><P>Madison</P></BODY></HTML>Fri, 15 Mar 2013 21:42:07 GMThttps://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103519#M5479Kojema2013-03-15T21:42:07ZRe: Assessing fit: Deviance and scaled deviance
https://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77662#M3699
<HTML><HEAD></HEAD><BODY><P>Of course! That makes a lot of sense.</P><P></P><P>So then does the scale parameter essentially add another term to the assumed relationship between the variance and mean of the distribtuion?</P><P></P><P>Thanks John!</P></BODY></HTML>Fri, 15 Mar 2013 20:34:52 GMThttps://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77662#M3699Kojema2013-03-15T20:34:52ZRe: Assessing fit: Deviance and scaled deviance
https://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77660#M3697
<HTML><HEAD></HEAD><BODY><P>Thanks for your reply John,</P><P></P><P>How interesting, the deviance/df using the normal distribtion is indeed equal to the residual variance. Perhaps this is just a mathematical fact, it would be interesting to compare the calculation of the two (but I'm not that ambitious!! haha)</P></BODY></HTML>Fri, 15 Mar 2013 20:23:01 GMThttps://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77660#M3697Kojema2013-03-15T20:23:01ZRe: Assessing fit: Deviance and scaled deviance
https://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77659#M3696
<HTML><HEAD></HEAD><BODY><P>Thanks for the reply, Steve.</P><P></P><P>Glad to hear that what I've done seems reasonable!</P><P></P><P>The interaction terms I mentioned refer to select interactions between variables that I was intersested in. The full model didn't include all possible interactions, but the nonsignificant interaction terms were removed sequentially using backwards elimination (most variables and interactions were retained, however). The dev/df ratio did not change much at all in eliminating these terms. I imagine that there are indeed other important explanatory variables "out there" that were not considered in the analysis, which could explain problems with fit.</P><P></P><P>Madison</P></BODY></HTML>Fri, 15 Mar 2013 20:18:29 GMThttps://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77659#M3696Kojema2013-03-15T20:18:29ZAssessing fit: Deviance and scaled deviance
https://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77656#M3693
<HTML><HEAD></HEAD><BODY><P>Hello,</P><P></P><P>I was wondering if anyone could clarify the use of deviance and/or scaled deviance to assess model fit (in Proc Genmod).</P><P></P><P>I understand that a deviance/df value that is much greater or smaller than 1 could indicate over or underdispersion of the response variable (or model misspecification). This could lead to incorrect estimation of the standard error of parameters, and thus misinterpreation of thir statistical significance. However, it is also my understanding that with the use of certain error distributions, such as the gamma or negative binomial, a scale parameter is estimated (I am assuming to model this over/underdispersion?). In these cases, is this parameter used to correct for the effect of over/underdispersion on error estimates, i.e. does SAS' output give estimates of parameter SE that are corrected for over/underdispersion? How does this translate to affect "LR Statistics For Type 3 Analysis", if it does at all? </P><P></P><P>Along similar lines, if the scale parameter is used to explicitly model dispersion, I presume scaled deviance/df can be used to assess model fit rather than deviance/df?</P><P></P><P>I am asking these questions in the context of the following example, which left me unsure of the appropriate "next step":</P><P></P><P>Variables:</P><P>-Highly right-skewed continuous response variable that becomes normally distributed when log transformed</P><P>-1 continuous independent variable, a couple categorical independent variables, select interaction terms</P><P>Modelling Approaches:</P><P>-general linear model approach with with log-trasformed response variable had dev/df=3.7 (which I thought was odd given the normally-distributed response variable, and the fact that the residuals were also normally distributed)</P><P>-generalized linear model approach with log link function and gamma error distribution had dev/df=2.9, scaled dev/df=1.3, scale=0.45</P><P></P><P>The reported dev/df value made me question the validity of the chosen error distribution and link function, but I couldn't think of more appropriate ones given the circumstances. I am not sure where to "go" next, I hope this doesn't warrant nonparametric analysis!!!! However, if the scaled dev/df value is the appropriate term to use to assess model fit, I think I am okay...</P><P></P><P>Any clarification on any of the above questions/musings would be great!!</P><P></P><P>Thanks!</P><P></P><P>Madison</P></BODY></HTML>Thu, 14 Mar 2013 20:01:32 GMThttps://communities.sas.com/t5/Statistical-Procedures/Assessing-fit-Deviance-and-scaled-deviance/m-p/77656#M3693Kojema2013-03-14T20:01:32ZRe: Proc Glimmix: RM ANCOVA
https://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103517#M5477
<HTML><HEAD></HEAD><BODY><P>Thanks Steve!</P><P></P><P>You were right, things ran a lot more smoothly using G-side specifications, and I was able to try out other covariance structures.</P><P></P><P>On a related note, the test of covariance parameters ended up being non signifcant. This was a different result from when I used R-side covariance structures, but when I removed the Random Month statement (leaving a significant "random intercept /subject=stream(AgInt)" effect), it didn't change the interpretations of the fixed effects, so I am presuming I shouldn't really worry about this discrepancy.</P><P></P><P>I also took your recommendation re: fitting, and used the Laplace method (quad gave the same results). Would you be able to recommend a denominator df adjustment? I used the "betwithin" option but can't say that I would have any good reason for choosing it over another method!!</P><P></P><P>Thanks</P><P></P><P>Madison</P></BODY></HTML>Mon, 11 Feb 2013 19:02:06 GMThttps://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103517#M5477Kojema2013-02-11T19:02:06ZProc Glimmix: RM ANCOVA
https://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103515#M5475
<HTML><HEAD></HEAD><BODY><P>Hello, I am performing an analysis that is kind of new to me, and is more complicated than any examples I can find online. I am a little bit over my head, and am experiencing convergence issues when I include certain interaction terms... I was wondering if anyone could answer some questions...</P><P></P><P>DESCRIPTION</P><P></P><P>-Goal is to test the difference in nitrogen export ("TN_Exp") from watersheds with differect agricultural intensities ("AgInt", 3 levels)</P><P>-measured monthy (March through Oct) exports of nitrogen in 19 watersheds, for 12 years</P><P>-Also wanted to test the effect of monthly precipitation ("RunnoffDepth", covariate) on export</P><P>-And ecological area ("EcoA", 4 levels, but not fully crossed with AgInt, so I have nested EcoA within AgInt: "EcoA(AgInt)")</P><P>-And whether there was a linear trend in the 12 years ("Year", covariate)</P><P>-And whether there was a seasonal effect ("Month", 8 levels)</P><P>-As this is an RM design, I Included an r-side random month term with Stream(AgInt) as a subject. I didn't necessarily expect temporal dependance between years, aside from any linear trend, however, see questions...</P><P>-TN_Exp is a right skewed continuos variable, thus I used a gamma distribution, with a log link function</P><P>-Stream(AgInt) was included as a random effect, as I did not care about specific differences between streams, but wanted to control for unmeasured characterisitcs at the watershed level</P><P></P><P>CODE</P><P></P><P>Proc glimmix data= ExpCoeff;</P><P>Class AgInt Month Stream EcoA;</P><P>Model TN_Exp = Month AgInt EcoA(AgInt)</P><P> RunoffDepth Year</P><P> AgInt*Month </P><P> RunoffDepth*AgInt RunoffDepth*Month Year*AgInt Year*Month</P><P> RunoffDepth*Year</P><P>/ dist=gamma link=log solution ddfm=kenwardroger; </P><P>Random Stream(AgInt) / solution; </P><P>Random Month / Subject= Stream(AgInt) Type= cs residual ;</P><P>covtest 'indep' indep;</P><P>nloptions maxiter=2000 tech= newrap;</P><P>Run;</P><P></P><P>QUESTIONS</P><P>1) In the "solution for random effects", all of the p-values for the estimated coefficients of each stream are >0.05 (in fact, the lowest is 0.41) Does this mean that the random effect of stream can be removed?</P><P>2) Is it necessary to nest Stream within Agint? Each stream has a different name...or perhaps I should be doing more complicated nesting since there are mulitple categorial variables????</P><P>3) For the repeated statement, should month be nested within year...I don't think so as I am assuming the correlations between months are similar from year to year...</P><P>4) For the Random Month effect, I would expect an Ante(1) or TOEP variance-covariance matirx would be better than CS, but get the following error when I use anything more complicated than cs:</P><P> WARNING: Obtaining minimum variance quadratic unbiased estimates as starting values for the covariance parameters failed. How might I be able to fix this?</P><P>5) If I wanted to consider temporal autocorrelation between years, would I include an an ADDITIONAL random statement?? i.e. Random Year / Subject= Stream(AgInt)</P><P>5) When I include interaction terms with EcoA(AgInt), the model does not converge... I am celarly not understanding something... can anyone explain this or enlighten me? Perhaps this is becasue I have interaction terms with AgInt, within which EcoA is nested?</P><P>7) When I include three-way interaction terms (i.e. RunoffDepth*AgInt*Year RunoffDepth*Month*Year Year*AgInt*Month) the model does not converge. Should I allow more iterations, or is the model perhaps simply too complicated. Is it acceptable to not include three-way interaction terms? It would be onerous to interpret such a complicated model anyways.... Is it acceptable to simply choose interaction terms you have reasons to believe would exist/ you are interested in?</P><P><span class="lia-unicode-emoji" title=":smiling_face_with_sunglasses:">😎</span> If you happen to notice an potential errors, I would love to know!!!</P><P></P><P>THANK YOU SO MUCH TO ANYONE WHO HAS ANY INSIGHT!!!!</P><P></P><P>Madison</P></BODY></HTML>Fri, 08 Feb 2013 18:08:50 GMThttps://communities.sas.com/t5/Statistical-Procedures/Proc-Glimmix-RM-ANCOVA/m-p/103515#M5475Kojema2013-02-08T18:08:50Z