Solved: How to analyze aggregated data with the number of outcome events and p...

dwhitney · Posted 01-02-2026 03:11 PM

I am interested in assessing the association between Healthcare Provider Shortage Area (HPSA; a categorical variable with 4 levels- low, medium, high, unknown) and incident fracture (binary outcome) with aggregated data. As described below, this dataset does NOT contain individual-level data, so I am not sure which effect estimate (e.g., HR, OR, IRR) or model is appropriate.

I am working with CMS data in a highly secure environment. This environment prohibits bringing any data or variables into the environment, even the HSPA variable. However, we are allowed to export data from the CMS computing environment, as long as it is aggregated data.

From ~15,000 unique individuals, we calculated the total number of fracture events and person-years per county, where mortality, end of follow-up, or lost to follow-up were used as a right censors to calculate individual-level person years. We then exported the data as county-level aggregated data from ~75 counties. The dataset is 89 rows because some counties changed HPSA status during the follow-up period, so each row is per county per HPSA status.

The exposure data is HPSA with 4 levels: low, medium, high, unknown. The outcome data is (1) total number of fracture events per county (each person is allowed up to 1 event, so not a count model) and (2) person-years per county. We did calculate the mean age and female proportion per county in the hopes of using these as covariates in a model.

From this, I believe we can calculate the crude IR and IRR across the 4 HPSA designations, but let me know if that is not right.

Where we are really stuck is what type of model to use to account for age and sex. We don't have individual-level data. We only have the total number of fracture events and total follow-up time, so we can't calculate a time-updated or instantaneous rate, and thus a standard Cox model does not seem to apply.

The other important caveat is that some of the HPSA designations changed during the follow-up period. We have already binned these time changes in anticipation for a time-varying Cox model. So, if a county changed HPSA status, we summed all the fracture events and person-time until the date of the change so that information can go towards the first HPSA status, then summed all fracture events and person-time from the change so that information goes towards the next HPSA status.

Is there a statistical approach we can use to compare the rates between HPSA categories using this aggregated data? Is there something additional needed to account for the variance of aggregated data? E.g., we have 89 rows of data, but in total these 89 rows represents ~15,000 people. I just want to make sure the variance estimation and confidence intervals are not unnecessarily wide.

quickbluefish · Posted 01-02-2026 07:37 PM

Out of curiosity, why can't you run models on the individual level data while still within the secure environment, then export the results? About your questions -- it would definitely be a bit easier for people to answer if you provided some sample data. Your description is very thorough; it's just that it does take a bit of effort for us to reconstruct. Anyway, one option might be to use the PROC STDRATE procedure, which will allow you to calculate (in your case) county-level incidence rates adjusted for age and sex from aggregated data (you'd have to decide between 'direct' and 'indirect' adjustment). I don't see how you'd do something like a Cox model (whether with time varying exposure or not) without individual person time (i.e., not already aggregated). As for your question about how to handle variance given that some counties appear more than once (with unique HPSA levels), I'm really not sure - perhaps others can answer.

View solution in original post

quickbluefish · Posted 01-02-2026 07:37 PM

Out of curiosity, why can't you run models on the individual level data while still within the secure environment, then export the results? About your questions -- it would definitely be a bit easier for people to answer if you provided some sample data. Your description is very thorough; it's just that it does take a bit of effort for us to reconstruct. Anyway, one option might be to use the PROC STDRATE procedure, which will allow you to calculate (in your case) county-level incidence rates adjusted for age and sex from aggregated data (you'd have to decide between 'direct' and 'indirect' adjustment). I don't see how you'd do something like a Cox model (whether with time varying exposure or not) without individual person time (i.e., not already aggregated). As for your question about how to handle variance given that some counties appear more than once (with unique HPSA levels), I'm really not sure - perhaps others can answer.

dwhitney · Posted 01-04-2026 04:42 PM

Thank you, kindly, for the response! I was not familiar with PROC STDRATE, which looks very relevant to some of the work I do; thank you!

Re: "why can't you run models on the individual level data while still within the secure environment, then export the results?" HPSA is the primary exposure of interest, which is not part of the dataset, so we wouldn't know how to group the data inside the enclave. To answer your potential follow-up question; HPSA is literally "written" into the dataset by typing the values based on the county code. Super simple, but the administrators consider that taking "external information" and putting it into the enclave and won't allow it...

Thanks for the heads up about providing sample data. I will do that next time.

After reviewing the PROC STDRATE, it occurred to me that I was wrong about this not being a count model. At the individual-level, it is a binary outcome (yes/no to a fracture), but after aggregating the data, we are actually left with the count of fractures in that county and person-time. We did some crude analyses and tested an unadjusted Poisson model which is giving us the exact same IRR estimate.

Thank you again!

How to analyze aggregated data with the number of outcome events and person time?

Re: How to analyze aggregated data with the number of outcome events and person time?

Re: How to analyze aggregated data with the number of outcome events and person time?

Re: How to analyze aggregated data with the number of outcome events and person time?

Catch up on SAS Innovate 2026