BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
samp945
Obsidian | Level 7

I am using SAS EG 8.3-U3. My question is more of a methodological question than a coding question - hopefully that is OK to post here? Apologies if not.

 

I am creating logistic regression models predicting outcomes of criminal arrest events, e.g., whether an arrestee hired a private attorney or not.

 

The code is straightforward:

 

proc logistic data = Multivariate descending;
model Attorney = &INDIVIDUAL &ARREST &KEYIVS;
run;

 

My confusion concerns about 20 mutually exclusive dummy variables I have created (as predictors) which indicate the type(s) of crime that lead to the arrest, e.g., drug crime, property crime, violent crime, etc. An arrestee could potentially be charged with every type of crime in my list of dummies, or they could be charged with only a single type of crime.

 

My interpretation of the dummies is whether each dummy (crime type) increased or decreased the odds of the arrestee hiring a private attorney. In other words, the odds ratio for the drug crime dummy is interpreted as the odds of an arrestee with a drug crime charge hiring a private attorney compared to an arrestee whose charges did not include drug crime.

 

However, examples of similar logistic regressions in the research literature that I’m familiar with do not use this strategy. Instead, they assign *one* type of crime as the reference category (e.g., property crime). Each dummy is then interpreted as the odds of an arrestee with a drug crime charge hiring a private attorney compared to an arrestee with a property crime charge. However, this strategy means that each arrest can only account for a single type of crime, even when there are multiple charges. One way that researchers accommodate this is by dropping all charges except for the most serious charge. I’d like to avoid this strategy if possible.

 

I assumed all was well with my proposed strategy as I ran my regressions and the results made intuitive sense. However, I was a bit apprehensive because almost all crime types increased the odds of the predicted outcome. This made me think that the results could be related to the fact that some arrestees have only a single charge, whereas other arrestees have multiple charges. Consequently, I then included a control in the model for the number of charges at arrest which changed the results drastically, with nearly all predictors that previously increased the odds of the predicted outcome now reducing the odds of the predicted outcome and the control for number of charges itself greatly increasing the odds of the predicted outcome. I do not understand why the inclusion of the control for number of charges would change the results so dramatically.

 

Is the strategy that I’m proposing even valid? If not, what options are available for modelling a predictor like “crime charged” with multiple, mutually exclusive possibilities? As described above, I understand that the strategy typically used is to reduce all arrests to a single charge, but I’d like to know if other options are available that do not involve so much data reduction. Your advice is appreciated. Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions
PaigeMiller
Diamond | Level 26

This sounds like multi-collinearity. When you add in another predictor variable (in this case number of charges at time of arrest) that is correlated with the other predictor variables, the results can and sometimes do change dramatically.

 

By the way, isn't number of charges at time of arrest simply the sum of the dummy variables? If so, this is a form of multi-collinearity where a linear combination of the x-variables is equal to (or nearly equal to) another linear combination of the x-variables.

 

There are several potential ways to mitigate this problem. In your case, I would simply remove the number of charges at the time of arrest, and then the results make sense. Other ways to mitigate this problem (not a complete list) are stepwise selection and logistic partial least squares, each of which have advantages and disadvantages.

--
Paige Miller

View solution in original post

3 REPLIES 3
PaigeMiller
Diamond | Level 26

This sounds like multi-collinearity. When you add in another predictor variable (in this case number of charges at time of arrest) that is correlated with the other predictor variables, the results can and sometimes do change dramatically.

 

By the way, isn't number of charges at time of arrest simply the sum of the dummy variables? If so, this is a form of multi-collinearity where a linear combination of the x-variables is equal to (or nearly equal to) another linear combination of the x-variables.

 

There are several potential ways to mitigate this problem. In your case, I would simply remove the number of charges at the time of arrest, and then the results make sense. Other ways to mitigate this problem (not a complete list) are stepwise selection and logistic partial least squares, each of which have advantages and disadvantages.

--
Paige Miller
samp945
Obsidian | Level 7
Paige, thank you for your helpful reply!

I agree that the issue seems to be due to the fact that number of charges and arrest charges is obviously highly correlated; the number of charges increases as the variety of arrest charges increases.

If my strategy is indeed valid, then simply dropping the control for number of charges does not pose a big problem. This control should have explanatory power, however (i.e., arrestees with more charges are likely to have different outcomes), so maybe one of the other approaches you have suggested (stepwise selection or partial least squares) would allow me to include all arrest charges in the model as well as a control for the number of charges. I'm not familiar with either of these approaches, so I'll have to read up on them.

Do you have any thoughts as to why it seems that researchers (to my knowledge) do not take the approach that I am using, instead preferring to drop all but one arrest charges per arrestee and then use a reference category for comparison? I think this may be done to avoid the "dummy variable trap". However, when cases can have *multiple* responses to the dummies (as in my case), I think the "dummy variable trap" is not an issue. So I can't understand why other researchers drop data rather than take the approach that I am suggesting.

Thanks again for your help!
PaigeMiller
Diamond | Level 26

Do you have any thoughts as to why it seems that researchers (to my knowledge) do not take the approach that I am using

 

No, this is not my field of endeavor and so I really can't comment on this.

--
Paige Miller

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 1088 views
  • 0 likes
  • 2 in conversation