PROC SURVEYPHREG: massive SEs and non-existent CIs for analysis of sma...

tarajenson · Posted 11-10-2022 07:02 PM

Hi friends,

I've been spun out for weeks trying to figure out how to have valid/rational SEs and confidence intervals generated for my Cox models analyzing complex survey data (NHANES). As per CDC/NHANES guidelines for having valid variance estimates generated when analyzing subgroups, I'm using a DOMAIN statement with an indicator within PROC SURVEYPHREG to select for my observations of interest in the model (not subsetting my analysis group and not using BY or WHERE in the model statement). Note: my subgroup (domain) of interest is only ~5k out of a total of ~100k observations, and when I run my models I get huge SEs and non-existent CIs. The results for the other domain (i.e. everyone else not in the 5k subgroup) - CIs and SEs are rational.

The only hints I've garnered from all of my trawling of SAS doc'n, google and crowd sourcing are that it relates to the smallness of my domain being analyzed (hence the DF of "infinity" for my likelihood ratio test stats output, and that "the usual assumptions that are required for a likelihood ratio test do not hold for the pseudo-likelihood that is used by PROC SURVEYPHREG (Rao, Scott, and Skinner 1998), leading to other methods of testing the global null hypothesis, such as the Wald test discussed in the following paragraph." (<-- from surveyphreg documentation.) However, I can find no specifics anywhere on alternatives to generating SEs/CIs in this scenario than what SURVEYPHREG kicks out.

I think I'm starting to grok why I'm getting nil CIs/massive SEs. I know SURVEYPHREG is behaving like it should, and nothing I do to tweak surveyphreg options and params makes any difference in the SE/CIs. The only relevant suggestion I was given was by someone who reviewed a manuscript where the researchers experienced same and just footnoted in their tables that valid CIs weren't able to be generated due to the small domain relative to the overall superset.

Quick screenshot of my output for my domain of interest, followed by my code:

proc surveyphreg data = diss.superset_99to18_wmort_v7;
CLUSTER SDMVPSU;
STRATA SDMVSTRA;
class RIDRETH1 marstat edulev truesmkstat;
model PERMTH_EXM*alz_mort_dichot(0) = UCD_creatadj INDFMPIR RIDAGEYR RIAGENDR
RIDRETH1 marstat edulev truesmkstat LBXBPB /RL;
domain indic_ucd99thru18('1');
weight wgt_ucd_99thru18;
run;

OsoGris · Posted 11-18-2022 04:16 AM

I noticed you have huge standard errors which are indicative of a montone likelihood problem. We have a SAS Note on this:

Problem Note 13679: A binary or categorical covariate with no events in one level may cause monotone likelihood convergence problems resulting in large standard errors

http://support.sas.com/techsup/notes/v8/13/679.html

tarajenson · Posted 11-20-2022 09:59 PM

Hi OsoGris,

Thank you for your reply and guidance! I looked into categorical variables with no events in some categories as this was suggested by some other colleagues and peers as well, however (and I should have led with this in my post) this happens even in my bare-bones model stripped of all covariates when my model statement is just:

model PERMTH_EXM*alz_mort_dichot(0) = UCD_creatadj;

For good measure though, I did check my cat vars from my fully adjusted models and there was one var with no events in two of the strata - so I collapsed that var so that's no longer the case. Re-running the fully adjusted model where I've confirmed there are no instances of zero events in any of the cat var strata results in the same large SE and no CIs (as happens in the base no-other-covars model).

Also - this does not happen using proc phreg (even with weight statement), only when using surveyphreg. proc phreg produces normal SEs and CIs - presumably because variance estimates in that case are generated using only the ~5k subset of my sample of interest (via WHERE statement), not the 101k "superset" of all the data that CDC directs needs to be used for NHANES data when using surveyphreg, and where my subsample of interest is selected for in the model by the DOMAIN statement and an indicator (rather than a WHERE statement that would just subset it).

Again I'm back to the hints I've seen that it may be related to the fact that the "superset" of data ~101k is so much larger than the subsample that's going into the model (~5k) - but no idea how to get around that in a more valid way than just using a WHERE statement instead of DOMAIN, and footnoting that likelihood ratio test is not able to provide valid SEs or CIs in this case.

Other thoughts? 🙂

SteveDenham · Posted 11-21-2022 01:29 PM

For some problems where the SE blows up, rescaling can help a lot. I noticed that the sum of weights is 4.6 x 10**7 for ~5K records used, so you are talking about average weight per record of about 10,000. So what happens if you divide all of your weights by 1000? Relative weights remain the same, and perhaps convergence and estimability can be improved. Of course, you have to remember this scaling when interpreting any results.

The other issue may be inconsistency in weights by class or strata. Potentially, you have 150 possible class levels (5x2x5x3), and if sample size and/or weighting are such that you consistently have very small values and very large values by class level, again you may find this sort of behavior.

Unfortunately, for the latter problem, I don't see an easy out, other than collapsing some class levels or eliminating one or more class variables.

SteveDenham

PROC SURVEYPHREG: massive SEs and non-existent CIs for analysis of small domain/subgroup

Re: PROC SURVEYPHREG: massive SEs and non-existent CIs for analysis of small domain/subgroup

Re: PROC SURVEYPHREG: massive SEs and non-existent CIs for analysis of small domain/subgroup

Re: PROC SURVEYPHREG: massive SEs and non-existent CIs for analysis of small domain/subgroup