BookmarkSubscribeRSS Feed
jsnake000777
Calcite | Level 5

I have a dataset with a row for each event and some classification data, but no data on subjects from the population without an event. Do I need to write a new row to the data set with an event column 0 in order to make contingency tables, do chi-square significance testing, anova, ect. Or is there a way to simply provide a total population number for each classification and these procedures can reference the number of events against the total population? Writing rows seems like the easy solution, but having 700k+ rows in my dataset that simply say no event seems unnecessary. 

5 REPLIES 5
PaigeMiller
Diamond | Level 26

@jsnake000777 wrote:

I have a dataset with a row for each event and some classification data, but no data on subjects from the population without an event. Do I need to write a new row to the data set with an event column 0 in order to make contingency tables, do chi-square significance testing, anova, ect. 


Yes. Leaving out the zeros would not be an acceptable analysis (and I point out that leaving out zeros from an analysis was one of the mistakes that led to the explosion of the Space Shuttle Challenger).

 

Or is there a way to simply provide a total population number for each classification and these procedures can reference the number of events against the total population?

 

Yes. It depends on what you are tying to do, which hasn't really been explained.

--
Paige Miller
jsnake000777
Calcite | Level 5

I'm trying to explore the relationship between event rate among different strata. I have 3 categorical variables (event type with 4 levels, process with 2 levels, and treatment with 2 levels). I want to see if the event rate is significantly different among the different levels of process, treatment, and event type. I do have total population information with event type, process, and treatment i.e. 125,764 products with event type 1, process 2, and treatment 1 and so on for each group.

 

I was hoping to not grow my data set from 6k rows to 700k by writing rows for non-event cases, but sounds like that is necessary. 

 

Was planning to use contingency tables, chi square, and ANOVA to explore these relationships but if you have a suggestion for another way to identify if process and treatment affect event rates I'd appreciate it. Thanks for the help. 

 

 

PaigeMiller
Diamond | Level 26

Since you have only categorical variables, I don't see how ANOVA fits here.

 

You have to have non-events to do contingency tables and Chi-squared tests properly.

--
Paige Miller
PGStats
Opal | Level 21

As @PaigeMiller said, you need the zeros for most analyses. But would you have the proper classification data corresponding to those non-events?

PG
StatDave
SAS Super FREQ

Yes, if your data only records the observed events, I believe you will need to add the nonevents to the data set. And since it sounds like each of your subjects is measured on all four event types, the correlation among these values within subjects should be accounted for. This could be done in several ways, but one typical way is with a Generalized Estimating Equations model. See the GEE example in the Getting Started section of the PROC GENMOD documentation which shows the data structure and analysis code. You will need a binary variable indicating an event or nonevent for each of your event types. Then your model can include event type (and process and treatment if desired) as predictors in the binomial (logistic) model.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 1575 views
  • 1 like
  • 4 in conversation