BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
issac
Fluorite | Level 6

Hi everyone;

I am working with a data set of patients containing lots of demographic covariates and some health related explanatory variables. I want to include some of them (of direct interest) into MODEL statements and some in STRATA but get confused to choose among them. Any helpful comments would be appreciative.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
Doc_Duke
Rhodochrosite | Level 12

Isaac,

Harrell's book "regression modelling Strategies" has good advice on model building.  It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.

I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable.  If they are approximately parallel, then the assumption holds reasonably.  Another way, to more formally test it, is to use the ASSESS statement in PHREG.

Doc Muhlbaier

Duke

View solution in original post

11 REPLIES 11
Doc_Duke
Rhodochrosite | Level 12

Isaac,

Harrell's book "regression modelling Strategies" has good advice on model building.  It needs a new edition for examples (both the S+ and SAS code are old), but the thought processes are good.

I generally only put a variable in as a STRATA variable if the proportional hazards assumption is not met; otherwise use a CLASS statement. One way to check proportionality is to plot the unadjusted survival curves by the class variable.  If they are approximately parallel, then the assumption holds reasonably.  Another way, to more formally test it, is to use the ASSESS statement in PHREG.

Doc Muhlbaier

Duke

issac
Fluorite | Level 6

Doc Muhlbaier;

Have tested PH assumptions on some covariates and seen the violations. After I put them in the CLASS and ended up with another problem. Most of those have large number of levels, say, Primary Care Team with 54 levels, Diagnostic Related Group (DRG) with more than 85 levels, Principal Diagnosis with near 60 levels, leading to huge dimension of design matrix and further ambiguous Global Test Results (P-Value {LR} = 0.67; P-Value {Score} < 0.001; P-Value {Wald} = 1) . I also examined different Effect Selection methods but no improvement has gained. What would you recommend to deal with this? Thanks so much!

Doc_Duke
Rhodochrosite | Level 12

Your model is probably over specified.  Another reason to look at Harrell's book is for his sage advice on the sample size needed relative to the number of outcomes (not total sample size, but the number of failures).  I don't have the book at home, but I think that it is 10-15 outcomes per degree of freedom in the fully specified model.  You have about 200 d.f. which requires 2000-3000 failures.  Likely you don't have that.  This requires some hard choices and likely needs some clinical input to collapse the categories in a meaningful way.

issac
Fluorite | Level 6

The data set has 3108 records, with 372 event times (near 88% of records right-censored). So instead of performing clinical trials, isn't there any way to overcome this problem? Perhaps grouping levels with some techniques? Thanks!

Doc_Duke
Rhodochrosite | Level 12

88% right censoring is not unusual; after all, most patients survive (we certainly hope so!).  Your target is about 35 d.f.  I would start by dumping either DRG or Principal Dx.  They are highly correlated (If you do a PROC FREQ on DX*DRG you will see a sparse matrix).  Then you are going to need to combine the different levels of the remaining ones; there are already some documents in the literature of ways to combine either Dx or DRG into groups with some cohesiveness.  Lastly, you've got to get the Physician Care Teams pared down; maybe combine by specialty or location.

I say this, fully expecting that your management would like to compare the Physician Care teams.  You just don't have enough data to do that. 

One possibility to explore is to totally shift gears out of survival analysis.  Maybe some sort of cost measure would be appropriate.  Then you have a continuous outcome and can reasonably have more d.f.  (Check Tsiatis and Angstrom for some papers on analyzing cost data; there are some important nuances to be aware of.  (They will have some references or be referred to by others in the field.).

issac
Fluorite | Level 6

Dr Muhlbaier


Thanks so much for your helpful comments. Have found Tsiatis's papers on the topic but didn't find something from Angstrom. And is there a specific keyword I should apply?

issac
Fluorite | Level 6

Dr. Muhlbaier

Is the 35 d.f. should be served for MODEL variables or for both MODEL and STRATA variables all together? 

Doc_Duke
Rhodochrosite | Level 12

Issac,

I don't really know here.  Remember these are guidelines, not mathematical proofs, so there is some wiggle room.  You might be able to not count the strata in the total d.f., but what you risk are some false positives.  If you include the strata in an interaction term, then the d.f. there definitely count.

You might want to do some bootstrap resampling to get some handle on the variability of the estimates.

Doc

issac
Fluorite | Level 6

Dr Muhlbaier;

I have got some answers for DRG and Principal Diagnostic. Actually the VA systems assign the Principal Diagnostic based on International Classification of Disease (ICD09) and by looking at them at this link

http://icd9cm.chrisendres.com/index.php?action=contents , I can group them into more summarized group and shrink their levels. Meanwhile,this is the case for DRG also, since I found that "DRGs may be further grouped into Major Diagnostic Categories (MDCs)", and hence by this data set, http://www.cms.hhs.gov/AcuteInpatientPPS/downloads/FY_2010_FR_Table_5.zip, I wanna do the same thing for DRG.

issac
Fluorite | Level 6

Dr Muhlbaier;


For a CLASS variable, I find that PH assumption is satisfied for one level but is not with another level. In words, PH is the case for (admissionsource = NHCU) but not validated for (admissionsource = domiciliary). So what should I do? put "admissionsource" in STRATA or not?


Thanks!

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 11 replies
  • 5666 views
  • 6 likes
  • 2 in conversation