08-03-2012 08:09 PM
I am new to SAS so please bear with me. I am trying to figure out the right statistics model to use for a data set I am working with.
Background: The data gives a score, on a scale of 0 to 4, of how severe a lesion is for 4 separate areas (regions). There are 4 different treatment groups. The treatment group would be the independent variable and the score would be the dependent variable. Using PROC FREQ each treatment group had a different distribution of scores.
This is what I used for the PROC FREQ
tables trt cage score trt*score region*trt*score/chisq;
We are most interested in the trt*score output but the region*trt*score would be interesting as well. What we want to find out (and are not sure if we can with proc freq or if we need to use something else) is if there is any significant differences between the score distributions of each treatment group.
We had used and ANOVA and were able to find out this information, however using the ANOVA may be controversial because the data is categorical rather than truly continuous. Someone had mentioned using a Poisson regression, but I am not sure if that test will be able to answer my question.
If anymore information is needed please let me know.
I am super new to SAS so please detail everything as much as possible. Thanks for all the help!! (From a super confused girl :smileyconfused
08-06-2012 08:34 AM
This might be a bit daunting for someone new to SAS, but I would suggest you look at PROC GENMOD, which uses a generalized linear model to fit what looks like (to me) a multinomial response.
Perhaps something like (see also Example:39.4 Ordinal Model for Multinomial Data in the documentation for the GENMOD procedure).
proc genmod data=yourdata;
class trt cage region;;
model score = trt cage region trt*cage trt*region cage*region trt*cage*region / dist=multinomial link=cumlogit aggregate type3;
<some ESTIMATE statements will need to be added here to get the odds ratios>
This will provide tests about the effects of interest. Comparisons will depend on the odds ratios. This assumes that the regions within a subject are independent. If you want to treat them as a repeated observation on the subject, then PROC GLIMMIX will be the tool to use. However, as a first cut at the data, try the genmod approach.
08-08-2012 06:53 PM
Thank you for the quick answer. Now that I have gotten a chance to try out this method I am incredibly confused (your instruction was straight forward and great, but the output really had me scratching my head)...I think I need a lot more help :s
When I read up on a multinomial response, I agree that my data does fit this approach. To my understanding the regions are independent within a subject (i.e. different sections of the intestine). I would also like to look at trt*score and region*trt*score. Would I be able to add that into the model statement like this:
model score = trt cage region trt*score trt*cage trt*region cage*region region*trt*score trt*cage*region / dist=multinomial link=cumlogit aggregate type3;
When I do this I get told: "WARNING: The relative Hessian convergence criterion of 8.8068819395 is greater than the limit of 0.0001. The convergence is questionable."
When I set it up the way you suggested I get told: "WARNING: Negative of Hessian not positive definite."
I tried to look into how to write estimate statements, but with each factor having at least 4 levels I was unsure how to set up the coding without at least 2 levels ending up looking the same.
Also when looking at the Analysis Of Maximum Likelihood Parameter Estimates I am getting confused as to what the Pr > ChiS is telling me is significant. If I had trt*region would it be telling me that the specific trt with the specific region are significantly associated? (correlated?...am I even on the right track?)
I am so sorry for how novice I am at this...I kind of got shoved into doing these types of statistics when I really only understand the bare minimum.
Thanks for any and all of your help, I really appreciate it!!
08-09-2012 07:52 AM
I think that warning comes from including the region*trt*score term as an independent variable. The warning that comes from my suggestion is a hint to me that the data are sparse, or quasi-separated.
So, take a quick look back at the PROC FREQ output. For the tables for score*trt*cage*region, are there tables where all of the values are missing? Are there tables where all of the trt*cage*region have the same score value? Either of these will just hammer the chances that a full model with as many levels as are specified will ever converge, or end up with decent estimates (Hessian needs to be positive definite to get standard errors).
I am kind of curious about the cage variable. It appears from your PROC FREQ code that you are not really interested in its effect--it may be a "nuisance" variable whose effect is only to add variability that you want to remove. If so, I offer PROC GLIMMIX (I really am taking someone who is just learning to swim and throwing them in the pool with MIssy Franklin). How about:
proc glimmix data=yourdata oddsratio;
class trt cage region;;
model score = trt region trt*region / dist=multinomial link=cumlogit ;
<some ESTIMATE statements will need to be added here to get the odds ratios comparing treatments at different regions from the interaction term>
Last bit of info is about the tests in GENMOD (and GLIMMIX). If the type 3 test is "significant", it means something is different from something else. If trt*region is significant, it means that the odds ratios for some set of scores are not the same for all treatments in all regions--that's why the ESTIMATE statement becomes important. It teases out which of the odds ratios differ. And I agree, getting them to do exactly what you want is not easy. It is not even mildly difficult. It's hard.
And so, after all of this, I want to go back to PROC FREQ and suggest the use of Cochran-Mantel-Haenszel tests. If cage is nothing more than a rep/blocking factor, try the following:
At the end of all of the output should be a table that looks like (taken from Example 36.7 for the FREQ Procedure, yours will differ)
|2||Row Mean Scores Differ||1||8.3052||0.0040|
Of these, look at the test for Row Mean Scores Differ. This will tell you if there is a relationship between treatment and score after controlling for region. This approach is simpler, and may address all of your concerns. If you have unequal representation of subjects by cage (especially subjects per treatment), then change the tables statement to: