turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- How do you perform Ad Hoc Correction test statisti...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-04-2016 02:33 PM

Simple Question:

Is there a way to alter the value of "n" used in the calculation of Standard Errors for the Logistic Procedure?

Details:

In order to produce an unbiased sample that represents the proper sampling rates of events and parameter values, sometimes the independent observations assumption must be violated. Additionally, certain hazard models allow indidual-time period observations across time, but the sample is only independent to the extent of individuals (meaning the set of each individual's time processes are independent, but each observation is not). Without correction of standard errors for this downward bias on the test statistics, the risk of Type 1 error is greatly increased. Is there a way to alter how this is calculated directly, in order to avoid having to make manual calculations for the Wald Chi-Square testing and p-values?

Example, consider the following data. We wish to estimate the hazard rate on this data (ignoring many forms of potential bias for simplicity of the example). The raw test-statistics will use 13 for "n" in the test statistic calculation, but I want it to use 3 instead, since there are only 3 individual and independent process:

Individual | Time Period | Y | X1 | X2 |

1 | 1 | 0 | 4 | 5 |

1 | 2 | 0 | 4 | 5 |

1 | 3 | 0 | 2 | 5 |

1 | 4 | 0 | 2 | 4 |

1 | 5 | 0 | 2 | 4 |

1 | 6 | 1 | 1 | 4 |

2 | 3 | 0 | 4 | 3 |

2 | 4 | 0 | 5 | 4 |

2 | 5 | 0 | 6 | 4 |

3 | 2 | 0 | 3 | 2 |

3 | 3 | 0 | 2 | 2 |

3 | 4 | 0 | 2 | 1 |

3 | 5 | 1 | 2 | 1 |

SAS 9.4 EG 6.1 32-bit SAS 9.4 EG 64-bit

Accepted Solutions

Solution

04-05-2016
10:53 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 10:53 AM

The sample data was cobbled together purely to describe the problem. Specific information would be a violation of my company's intellectual property. However, I found a solution to my problem and will share it. It was far simpler than I imagined.

If the inflation factor is known, like in this above example:

13 observations/3 subpopulations=4.33 independent groups

However, the heterogeneity constant correction used by Scale=<constant> is squared, so we must take the square root:

sqrt(4.33)=2.08

From this, we can rescale the confidence intervals and p-value estimates directly using the SCALE option:

PROC LOGISTIC DATA=indata; MODEL Y=X1 X2/SCALE=2.08;RUN;

Despite not sharing the real data/approach etc. Steve is correct in identifying the missing RE. This simple scaling imbeds assumptions about the relevance of start-points for each individual's process. Depending on how start-points and end-points are captured in the DGP and the underlying reasons, this can create substantial bias in estimating the model. For a more clear understanding of this example and the applicability to various event-modeling...more specifically, why this was not a substantial concern for my real data and problem, consider (Paul Allison, 1982), (Tyler Shumway, 2001), (Tyler Shumway, 2004). Duration capturing elements such as age or event-life can be used in the model to verify that the start-point concerns are not driving the model.

All Replies

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-04-2016 11:14 PM

I am not sure if you are talking about Condition Logistic ? If it were , check STRATA statement.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 08:49 AM - edited 04-05-2016 08:53 AM

My suggestion (as it almost always is, it seems) is to look at doing the logistic regression in PROC GLIMMIX, with subject as a RANDOM effect. This should correctly cluster the data, and result in the correct degrees of freedom for tests and confidence intervals. For a worked example, see Example 45.18 Weighted Multilevel Model for Survey Data in the PROC GLIMMIX documentation (SAS/STAT 14.1).

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 09:07 AM

@SteveDenham I thought that you might suggest this. For those of us who are not experts in this area, could you briefly explain why you did not recommend GENMOD and the REPEATED statement (GEE approach)?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 09:30 AM

@Rick_SAS, my concern here was not with the repeated nature of the data, but of the clustered nature by subject, which would constiute a random effect that isn't modeled in GENMOD or GEE. The sample data really looks like that in Ex. 45.18, without the sampling weights, which could be added to a subsequent analysis.

Steve Denham

Solution

04-05-2016
10:53 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 10:53 AM

The sample data was cobbled together purely to describe the problem. Specific information would be a violation of my company's intellectual property. However, I found a solution to my problem and will share it. It was far simpler than I imagined.

If the inflation factor is known, like in this above example:

13 observations/3 subpopulations=4.33 independent groups

However, the heterogeneity constant correction used by Scale=<constant> is squared, so we must take the square root:

sqrt(4.33)=2.08

From this, we can rescale the confidence intervals and p-value estimates directly using the SCALE option:

PROC LOGISTIC DATA=indata; MODEL Y=X1 X2/SCALE=2.08;RUN;

Despite not sharing the real data/approach etc. Steve is correct in identifying the missing RE. This simple scaling imbeds assumptions about the relevance of start-points for each individual's process. Depending on how start-points and end-points are captured in the DGP and the underlying reasons, this can create substantial bias in estimating the model. For a more clear understanding of this example and the applicability to various event-modeling...more specifically, why this was not a substantial concern for my real data and problem, consider (Paul Allison, 1982), (Tyler Shumway, 2001), (Tyler Shumway, 2004). Duration capturing elements such as age or event-life can be used in the model to verify that the start-point concerns are not driving the model.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 11:08 AM

Okay. Interesting. Do you also want to AGGREGATE over the individuals?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 11:46 AM - edited 04-05-2016 11:58 AM

I was trying to aggregate over individuals, as this is the effect I am trying to proxy via the SCALE. But the real data can include anywhere between 150 and 10,000 individuals. Additionally, I was having trouble figuring out how to use it to globally assign these groups since they are not a predictor in the model. I reviewed everything I could find on the AGGREGATE and SCALE functions trying to figure this out, and gave up when I found that I could simplify by just imbedding assumptions via a direct scaling function for the standard errors. It remains unclear to me how substantial the risk is given other assumptions in the approach, but so far, testing seems to be in line with expectations.

I should also note that moderate Type 1 error is not generally the end of the world for what is being done here...but it should be tested somewhat accurately to prevent potentially serious and dangerous misspecification. I find that it is very easy to overstate the importance of p-values in this field, especially when there are micronumerosity concerns. Nonetheless, the CI's and p-values should be close to accurate or harmful decisions can be made.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 12:21 PM - edited 04-05-2016 12:22 PM

@DLBarker, I love the term micronumerosity. In my field, it is generally referred to as pseudo-replication, and is associated with experimental unit confusion. Here though, it appears you have both random and repeated effects, and one of the built in benefits of the mixed model approach is the correct assignment of degrees of freedom (provided that a correctly specified model exists and is applied). The field that seems most concerned about using a mixed model approach, in my experience, is econometrics, and the reason often given is the "bias" associated with minimized variance estimators. Laplace or adaptive quadrature methods go a long way toward alleviating this, but that is only my opinion. I guess I would cite Stroup (2012), Bolker (2009) and Bates (2014) for approaches on how to minimize bias in a generalized linear mixed modeling schema.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 01:03 PM - edited 04-05-2016 01:04 PM

Thank you @SteveDenham

I am going to be doing further research on this and the applications of the GLIMMIX procedure to these problems. I just ordered a copy of Stroup's book, which seems to deal exclusively with GLMM approaches in SAS. I look forward to the insight it may provide for future approaches...however, I am unsure how well I will be able to create a scorecard specification from it. It will be another year before I have to do a methodological review though, so I have time to research, play around and test.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-05-2016 01:14 PM

@SteveDenham I do still have the fear that GLMM may overcorrect for when the information quantities of observations for each individual are too few. Again, I will just have to test this concern in the future. Testing and sim is generally easier than simply pondering the impact.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-07-2016 11:32 AM

@DLBarker, I agree on the problem of insufficient info--it makes the solutions unstable, if they converge at all. Simpler models are then often used, and the true research question is not addressed. The simulation approach is where I would go--and simulating correlated/clustered data that does not fit a multivariate normal is a daunting task in itself. Check out Bolker's text on ecological models. You'll have to translate from R to SAS in a lot of places, but the theoretical approach should help.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-07-2016 12:01 PM

To be fair, in this industry, I am not sure I ever want the research question "addressed." The beauty of working in small data modeling is that the efforts are necessary AND unending. My industry employs a ton of quants, I like knowing that they will retain job security in the decades to come! Chasing information amidst dozens of unstable econometric models is like a game of whack-a-mole. And, whereas at some point, the game becomes boring; that is when it is time to leave it to the next generation and retire into senior management. Generically stated (and this may offend more academic practitioners of my craft), my job is to cobble together a series of illustrative and useful lies in such a way that it passes the scrutiny of federal oversight, while being good enough to create a competitive advantage and provide for useful estimates to my firm. This field requires a form of relaxed cynicism that many people in econometric practices could benefit from. There is no right answer, everything we do is wrong, but some answers are more useful than others! The constant searching for methodological improvements that will enhance the usefulness of these outputs should not be measured against the black and white notion of right and wrong. Even competing approaches where one approach is "more wrong" may produce "more useful" outputs simply because "more wrong" on a particular concern can still be "more practical" in application. I am an industry model theorist (a unicorn) and dedicate my time to trying to balance these two things while creating employable and interpretable methodologies that our model engineers can easily and efficiently apply. Nonetheless, I am overjoyed to see your comments. As a result of your comments, I have recently discovered a new vault of research (which is not yet well known in the field) that applies directly to a few of my concerns. I have little doubt that the next methodology I create to mitigate these concerns will take a broader look at the potential of GLMM and hybrid-Bayesian approaches. However, these approaches are difficult to deploy in our current environment due to end-user issues. I came to this board with a simple question about scaling estimates for the VCVM for certain tests on an existing methodology, and came out with a head-start on research for the next generation of these models. I can't thank you enough. Keep on @SteveDenham