Hi everyone;
I am working with a data set of patients containing lots of demographic covariates and some health related explanatory variables. I want to include some of them (of direct interest) into MODEL statements and some in STRATA but get confused to choose among them. Any helpful comments would be appreciative.
Thanks!
Isaac,
Harrell's book "regression modelling Strategies" has good advice on model building. It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.
I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable. If they are approximately parallel, then the assumption holds reasonably. Another way, to more formally test it, is to use the ASSESS statement in PHREG.
Doc Muhlbaier
Duke
Isaac,
Harrell's book "regression modelling Strategies" has good advice on model building. It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.
I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable. If they are approximately parallel, then the assumption holds reasonably. Another way, to more formally test it, is to use the ASSESS statement in PHREG.
Doc Muhlbaier
Duke
Doc Muhlbaier;
Have tested PH assumptions on some covariates and seen the violations. After I put them in the CLASS and ended up with another problem. Most of those have large number of levels, say, Primary Care Team with 54 levels, Diagnostic Related Group (DRG) with more than 85 levels, Principal Diagnosis with near 60 levels, leading to huge dimension of design matrix and further ambiguous Global Test Results (P-Value {LR} = 0.67; P-Value {Score} < 0.001; P-Value {Wald} = 1) . I also examined different Effect Selection methods but no improvement has gained. What would you recommend to deal with this? Thanks so much!
Your model is probably over specified. Another reason to look at Harrell's book is for his sage advice on the sample size needed relative to the number of outcomes (not total sample size, but the number of failures). I don't have the book at home, but I think that it is 10-15 outcomes per degree of freedom in the fully specified model. You have about 200 d.f. which requires 2000-3000 failures. Likely you don't have that. This requires some hard choices and likely needs some clinical input to collapse the categories in a meaningful way.
The data set has 3108 records, with 372 event times (near 88% of records right-censored). So instead of performing clinical trials, isn't there any way to overcome this problem? Perhaps grouping levels with some techniques? Thanks!
88% right censoring is not unusual; after all, most patients survive (we certainly hope so!). Your target is about 35 d.f. I would start by dumping either DRG or Principal Dx. They are highly correlated (If you do a PROC FREQ on DX*DRG you will see a sparse matrix). Then you are going to need to combine the different levels of the remaining ones; there are already some documents in the literature of ways to combine either Dx or DRG into groups with some cohesiveness. Lastly, you've got to get the Physician Care Teams pared down; maybe combine by specialty or location.
I say this, fully expecting that your management would like to compare the Physician Care teams. You just don't have enough data to do that.
One possibility to explore is to totally shift gears out of survival analysis. Maybe some sort of cost measure would be appropriate. Then you have a continuous outcome and can reasonably have more d.f. (Check Tsiatis and Angstrom for some papers on analyzing cost data; there are some important nuances to be aware of. (They will have some references or be referred to by others in the field.).
Dr Muhlbaier
Thanks so much for your helpful comments. Have found Tsiatis's papers on the topic but didn't find something from Angstrom. And is there a specific keyword I should apply?
Typo. Anstrom. This paper may also be of interest:
Techniques for estimating health ca... [Clinicoecon Outcomes Res. 2012] - PubMed - NCBI
Dr. Muhlbaier
Is the 35 d.f. should be served for MODEL variables or for both MODEL and STRATA variables all together?
Issac,
I don't really know here. Remember these are guidelines, not mathematical proofs, so there is some wiggle room. You might be able to not count the strata in the total d.f., but what you risk are some false positives. If you include the strata in an interaction term, then the d.f. there definitely count.
You might want to do some bootstrap resampling to get some handle on the variability of the estimates.
Doc
Dr Muhlbaier;
I have got some answers for DRG and Principal Diagnostic. Actually the VA systems assign the Principal Diagnostic based on International Classification of Disease (ICD09) and by looking at them at this link
http://icd9cm.chrisendres.com/index.php?action=contents , I can group them into more summarized group and shrink their levels. Meanwhile,this is the case for DRG also, since I found that "DRGs may be further grouped into Major Diagnostic Categories (MDCs)", and hence by this data set, http://www.cms.hhs.gov/AcuteInpatientPPS/downloads/FY_2010_FR_Table_5.zip, I wanna do the same thing for DRG.
Dr Muhlbaier;
For a CLASS variable, I find that PH assumption is satisfied for one level but is not with another level. In words, PH is the case for (admissionsource = NHCU) but not validated for (admissionsource = domiciliary). So what should I do? put "admissionsource" in STRATA or not?
Thanks!
Don't miss out on SAS Innovate - Register now for the FREE Livestream!
Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.