Solved: Re: How to choose strata variable(s) in PROC PHREG?

issac · Posted 08-01-2012 03:17 PM

Hi everyone;

I am working with a data set of patients containing lots of demographic covariates and some health related explanatory variables. I want to include some of them (of direct interest) into MODEL statements and some in STRATA but get confused to choose among them. Any helpful comments would be appreciative.

Thanks!

Doc_Duke · Posted 08-01-2012 03:41 PM

Isaac,

Harrell's book "regression modelling Strategies" has good advice on model building. It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.

I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable. If they are approximately parallel, then the assumption holds reasonably. Another way, to more formally test it, is to use the ASSESS statement in PHREG.

Doc Muhlbaier

Duke

View solution in original post

Doc_Duke · Posted 08-01-2012 03:41 PM

Isaac,

Harrell's book "regression modelling Strategies" has good advice on model building. It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.

I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable. If they are approximately parallel, then the assumption holds reasonably. Another way, to more formally test it, is to use the ASSESS statement in PHREG.

Doc Muhlbaier

Duke

issac · Posted 08-01-2012 06:22 PM

Doc Muhlbaier;

Have tested PH assumptions on some covariates and seen the violations. After I put them in the CLASS and ended up with another problem. Most of those have large number of levels, say, Primary Care Team with 54 levels, Diagnostic Related Group (DRG) with more than 85 levels, Principal Diagnosis with near 60 levels, leading to huge dimension of design matrix and further ambiguous Global Test Results (P-Value {LR} = 0.67; P-Value {Score} < 0.001; P-Value {Wald} = 1) . I also examined different Effect Selection methods but no improvement has gained. What would you recommend to deal with this? Thanks so much!

Doc_Duke · Posted 08-01-2012 07:06 PM

Your model is probably over specified. Another reason to look at Harrell's book is for his sage advice on the sample size needed relative to the number of outcomes (not total sample size, but the number of failures). I don't have the book at home, but I think that it is 10-15 outcomes per degree of freedom in the fully specified model. You have about 200 d.f. which requires 2000-3000 failures. Likely you don't have that. This requires some hard choices and likely needs some clinical input to collapse the categories in a meaningful way.

issac · Posted 08-01-2012 07:41 PM

The data set has 3108 records, with 372 event times (near 88% of records right-censored). So instead of performing clinical trials, isn't there any way to overcome this problem? Perhaps grouping levels with some techniques? Thanks!

Doc_Duke · Posted 08-01-2012 09:02 PM

88% right censoring is not unusual; after all, most patients survive (we certainly hope so!). Your target is about 35 d.f. I would start by dumping either DRG or Principal Dx. They are highly correlated (If you do a PROC FREQ on DX*DRG you will see a sparse matrix). Then you are going to need to combine the different levels of the remaining ones; there are already some documents in the literature of ways to combine either Dx or DRG into groups with some cohesiveness. Lastly, you've got to get the Physician Care Teams pared down; maybe combine by specialty or location.

I say this, fully expecting that your management would like to compare the Physician Care teams. You just don't have enough data to do that.

One possibility to explore is to totally shift gears out of survival analysis. Maybe some sort of cost measure would be appropriate. Then you have a continuous outcome and can reasonably have more d.f. (Check Tsiatis and Angstrom for some papers on analyzing cost data; there are some important nuances to be aware of. (They will have some references or be referred to by others in the field.).

issac · Posted 08-02-2012 09:22 AM

Dr Muhlbaier

Thanks so much for your helpful comments. Have found Tsiatis's papers on the topic but didn't find something from Angstrom. And is there a specific keyword I should apply?

Doc_Duke · Posted 08-02-2012 11:11 AM

Typo. Anstrom. This paper may also be of interest:

Techniques for estimating health ca... [Clinicoecon Outcomes Res. 2012] - PubMed - NCBI

issac · Posted 08-02-2012 03:45 PM

Dr. Muhlbaier

Is the 35 d.f. should be served for MODEL variables or for both MODEL and STRATA variables all together?

Doc_Duke · Posted 08-03-2012 09:17 AM

Issac,

I don't really know here. Remember these are guidelines, not mathematical proofs, so there is some wiggle room. You might be able to not count the strata in the total d.f., but what you risk are some false positives. If you include the strata in an interaction term, then the d.f. there definitely count.

You might want to do some bootstrap resampling to get some handle on the variability of the estimates.

Doc

issac · Posted 08-06-2012 12:45 PM

Dr Muhlbaier;

I have got some answers for DRG and Principal Diagnostic. Actually the VA systems assign the Principal Diagnostic based on International Classification of Disease (ICD09) and by looking at them at this link

http://icd9cm.chrisendres.com/index.php?action=contents , I can group them into more summarized group and shrink their levels. Meanwhile,this is the case for DRG also, since I found that "DRGs may be further grouped into Major Diagnostic Categories (MDCs)", and hence by this data set, http://www.cms.hhs.gov/AcuteInpatientPPS/downloads/FY_2010_FR_Table_5.zip, I wanna do the same thing for DRG.

issac · Posted 08-08-2012 03:13 PM

Dr Muhlbaier;

For a CLASS variable, I find that PH assumption is satisfied for one level but is not with another level. In words, PH is the case for (admissionsource = NHCU) but not validated for (admissionsource = domiciliary). So what should I do? put "admissionsource" in STRATA or not?

Thanks!