BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
DLBarker
Fluorite | Level 6

Simple Question:

 

Is there a way to alter the value of "n" used in the calculation of Standard Errors for the Logistic Procedure?

 

 

Details:

In order to produce an unbiased sample that represents the proper sampling rates of events and parameter values, sometimes the independent observations assumption must be violated.  Additionally, certain hazard models allow indidual-time period observations across time, but the sample is only independent to the extent of individuals (meaning the set of each individual's time processes are independent, but each observation is not).  Without correction of standard errors for this downward bias on the test statistics, the risk of Type 1 error is greatly increased.  Is there a way to alter how this is calculated directly, in order to avoid having to make manual calculations for the Wald Chi-Square testing and p-values?

 

Example, consider the following data.  We wish to estimate the hazard rate on this data (ignoring many forms of potential bias for simplicity of the example).  The raw test-statistics will use 13 for "n" in the test statistic calculation, but I want it to use 3 instead, since there are only 3 individual and independent process:

IndividualTime PeriodYX1X2
11045
12045
13025
14024
15024
16114
23043
24054
25064
32032
33022
34021
35121

 

SAS 9.4 EG 6.1 32-bit  SAS 9.4 EG 64-bit

1 ACCEPTED SOLUTION

Accepted Solutions
DLBarker
Fluorite | Level 6

The sample data was cobbled together purely to describe the problem.  Specific information would be a violation of my company's intellectual property.  However, I found a solution to my problem and will share it.  It was far simpler than I imagined.

 

If the inflation factor is known, like in this above example:

 

13 observations/3 subpopulations=4.33 independent groups

 

However, the heterogeneity constant correction used by Scale=<constant> is squared, so we must take the square root:

 

sqrt(4.33)=2.08

 

From this, we can rescale the confidence intervals and p-value estimates directly using the SCALE option:

 

 

PROC LOGISTIC DATA=indata; MODEL Y=X1 X2/SCALE=2.08;RUN;

 

Despite not sharing the real data/approach etc. Steve is correct in identifying the missing RE.  This simple scaling imbeds assumptions about the relevance of start-points for each individual's process. Depending on how start-points and end-points are captured in the DGP and the underlying reasons, this can create substantial bias in estimating the model.  For a more clear understanding of this example and the applicability to various event-modeling...more specifically, why this was not a substantial concern for my real data and problem, consider (Paul Allison, 1982), (Tyler Shumway, 2001), (Tyler Shumway, 2004).  Duration capturing elements such as age or event-life can be used in the model to verify that the start-point concerns are not driving the model.

 

 

View solution in original post

12 REPLIES 12
Ksharp
Super User
I am not sure if you are talking about Condition Logistic ? 
If it were , check STRATA statement.



SteveDenham
Jade | Level 19

My suggestion (as it almost always is, it seems) is to look at doing the logistic regression in PROC GLIMMIX, with subject as a RANDOM effect.  This should correctly cluster the data, and result in the correct degrees of freedom for tests and confidence intervals.  For a worked example, see Example 45.18 Weighted Multilevel Model for Survey Data in the PROC GLIMMIX documentation (SAS/STAT 14.1).

 

Steve Denham

Rick_SAS
SAS Super FREQ

@SteveDenham I thought that you might suggest this. For those of us who are not experts in this area, could you briefly explain why you did not recommend GENMOD and the REPEATED statement (GEE approach)?

SteveDenham
Jade | Level 19

@Rick_SAS, my concern here was not with the repeated nature of the data, but of the clustered nature by subject, which would constiute a random effect that isn't modeled in GENMOD or GEE.  The sample data really looks like that in Ex. 45.18, without the sampling weights, which could be added to a subsequent analysis.

 

Steve Denham

DLBarker
Fluorite | Level 6

The sample data was cobbled together purely to describe the problem.  Specific information would be a violation of my company's intellectual property.  However, I found a solution to my problem and will share it.  It was far simpler than I imagined.

 

If the inflation factor is known, like in this above example:

 

13 observations/3 subpopulations=4.33 independent groups

 

However, the heterogeneity constant correction used by Scale=<constant> is squared, so we must take the square root:

 

sqrt(4.33)=2.08

 

From this, we can rescale the confidence intervals and p-value estimates directly using the SCALE option:

 

 

PROC LOGISTIC DATA=indata; MODEL Y=X1 X2/SCALE=2.08;RUN;

 

Despite not sharing the real data/approach etc. Steve is correct in identifying the missing RE.  This simple scaling imbeds assumptions about the relevance of start-points for each individual's process. Depending on how start-points and end-points are captured in the DGP and the underlying reasons, this can create substantial bias in estimating the model.  For a more clear understanding of this example and the applicability to various event-modeling...more specifically, why this was not a substantial concern for my real data and problem, consider (Paul Allison, 1982), (Tyler Shumway, 2001), (Tyler Shumway, 2004).  Duration capturing elements such as age or event-life can be used in the model to verify that the start-point concerns are not driving the model.

 

 

Rick_SAS
SAS Super FREQ

Okay. Interesting. Do you also want to AGGREGATE over the individuals?

 

DLBarker
Fluorite | Level 6

I was trying to aggregate over individuals, as this is the effect I am trying to proxy via the SCALE. But the real data can include anywhere between 150 and 10,000 individuals. Additionally, I was having trouble figuring out how to use it to globally assign these groups since they are not a predictor in the model. I reviewed everything I could find on the AGGREGATE and SCALE functions trying to figure this out, and gave up when I found that I could simplify by just imbedding assumptions via a direct scaling function for the standard errors. It remains unclear to me how substantial the risk is given other assumptions in the approach, but so far, testing seems to be in line with expectations.

 

I should also note that moderate Type 1 error is not generally the end of the world for what is being done here...but it should be tested somewhat accurately to prevent potentially serious and dangerous misspecification.  I find that it is very easy to overstate the importance of p-values in this field, especially when there are micronumerosity concerns.  Nonetheless, the CI's and p-values should be close to accurate or harmful decisions can be made.

SteveDenham
Jade | Level 19

@DLBarker, I love the term micronumerosity.  In my field, it is generally referred to as pseudo-replication, and is associated with experimental unit confusion.  Here though, it appears you have both random and repeated effects, and one of the built in benefits of the mixed model approach is the correct assignment of degrees of freedom (provided that a correctly specified model exists and is applied).  The field that seems most concerned about using a mixed model approach, in my experience, is econometrics, and the reason often given is the "bias" associated with minimized variance estimators.  Laplace or adaptive quadrature methods go a long way toward alleviating this, but that is only my opinion.  I guess I would cite Stroup (2012), Bolker (2009) and Bates (2014) for approaches on how to minimize bias in a generalized linear mixed modeling schema.

 

Steve Denham

DLBarker
Fluorite | Level 6

Thank you @SteveDenham

I am going to be doing further research on this and the applications of the GLIMMIX procedure to these problems. I just ordered a copy of Stroup's book, which seems to deal exclusively with GLMM approaches in SAS. I look forward to the insight it may provide for future approaches...however, I am unsure how well I will be able to create a scorecard specification from it. It will be another year before I have to do a methodological review though, so I have time to research, play around and test.

DLBarker
Fluorite | Level 6

@SteveDenham  I do still have the fear that GLMM may overcorrect for when the information quantities of observations for each individual are too few.  Again, I will just have to test this concern in the future.  Testing and sim is generally easier than simply pondering the impact.

SteveDenham
Jade | Level 19

@DLBarker, I agree on the problem of insufficient info--it makes the solutions unstable, if they converge at all.  Simpler models are then often used, and the true research question is not addressed.  The simulation approach is where I would go--and simulating correlated/clustered data that does not fit a multivariate normal is a daunting task in itself.  Check out Bolker's text on ecological models.  You'll have to translate from R to SAS in a lot of places, but the theoretical approach should help.

 

Steve Denham

DLBarker
Fluorite | Level 6

To be fair, in this industry, I am not sure I ever want the research question "addressed."  The beauty of working in small data modeling is that the efforts are necessary AND unending.  My industry employs a ton of quants, I like knowing that they will retain job security in the decades to come!  Chasing information amidst dozens of unstable econometric models is like a game of whack-a-mole.  And, whereas at some point, the game becomes boring; that is when it is time to leave it to the next generation and retire into senior management.  Generically stated (and this may offend more academic practitioners of my craft), my job is to cobble together a series of illustrative and useful lies in such a way that it passes the scrutiny of federal oversight, while being good enough to create a competitive advantage and provide for useful estimates to my firm.  This field requires a form of relaxed cynicism that many people in econometric practices could benefit from.  There is no right answer, everything we do is wrong, but some answers are more useful than others!  The constant searching for methodological improvements that will enhance the usefulness of these outputs should not be measured against the black and white notion of right and wrong.  Even competing approaches where one approach is "more wrong" may produce "more useful" outputs simply because "more wrong" on a particular concern can still be "more practical" in application.  I am an industry model theorist (a unicorn) and dedicate my time to trying to balance these two things while creating employable and interpretable methodologies that our model engineers can easily and efficiently apply.  Nonetheless, I am overjoyed to see your comments.  As a result of your comments, I have recently discovered a new vault of research (which is not yet well known in the field) that applies directly to a few of my concerns.  I have little doubt that the next methodology I create to mitigate these concerns will take a broader look at the potential of GLMM and hybrid-Bayesian approaches.  However, these approaches are difficult to deploy in our current environment due to end-user issues.  I came to this board with a simple question about scaling estimates for the VCVM for certain tests on an existing methodology, and came out with a head-start on research for the next generation of these models.  I can't thank you enough.  Keep on @SteveDenham

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1940 views
  • 13 likes
  • 4 in conversation