I am using SAS EG 8.3-U3. My question is more of a methodological question than a coding question - hopefully that is OK to post here? Apologies if not.
I am creating logistic regression models predicting outcomes of criminal arrest events, e.g., whether an arrestee hired a private attorney or not.
The code is straightforward:
proc logistic data = Multivariate descending;
model Attorney = &INDIVIDUAL &ARREST &KEYIVS;
run;
My confusion concerns about 20 mutually exclusive dummy variables I have created (as predictors) which indicate the type(s) of crime that lead to the arrest, e.g., drug crime, property crime, violent crime, etc. An arrestee could potentially be charged with every type of crime in my list of dummies, or they could be charged with only a single type of crime.
My interpretation of the dummies is whether each dummy (crime type) increased or decreased the odds of the arrestee hiring a private attorney. In other words, the odds ratio for the drug crime dummy is interpreted as the odds of an arrestee with a drug crime charge hiring a private attorney compared to an arrestee whose charges did not include drug crime.
However, examples of similar logistic regressions in the research literature that I’m familiar with do not use this strategy. Instead, they assign *one* type of crime as the reference category (e.g., property crime). Each dummy is then interpreted as the odds of an arrestee with a drug crime charge hiring a private attorney compared to an arrestee with a property crime charge. However, this strategy means that each arrest can only account for a single type of crime, even when there are multiple charges. One way that researchers accommodate this is by dropping all charges except for the most serious charge. I’d like to avoid this strategy if possible.
I assumed all was well with my proposed strategy as I ran my regressions and the results made intuitive sense. However, I was a bit apprehensive because almost all crime types increased the odds of the predicted outcome. This made me think that the results could be related to the fact that some arrestees have only a single charge, whereas other arrestees have multiple charges. Consequently, I then included a control in the model for the number of charges at arrest which changed the results drastically, with nearly all predictors that previously increased the odds of the predicted outcome now reducing the odds of the predicted outcome and the control for number of charges itself greatly increasing the odds of the predicted outcome. I do not understand why the inclusion of the control for number of charges would change the results so dramatically.
Is the strategy that I’m proposing even valid? If not, what options are available for modelling a predictor like “crime charged” with multiple, mutually exclusive possibilities? As described above, I understand that the strategy typically used is to reduce all arrests to a single charge, but I’d like to know if other options are available that do not involve so much data reduction. Your advice is appreciated. Thanks!
This sounds like multi-collinearity. When you add in another predictor variable (in this case number of charges at time of arrest) that is correlated with the other predictor variables, the results can and sometimes do change dramatically.
By the way, isn't number of charges at time of arrest simply the sum of the dummy variables? If so, this is a form of multi-collinearity where a linear combination of the x-variables is equal to (or nearly equal to) another linear combination of the x-variables.
There are several potential ways to mitigate this problem. In your case, I would simply remove the number of charges at the time of arrest, and then the results make sense. Other ways to mitigate this problem (not a complete list) are stepwise selection and logistic partial least squares, each of which have advantages and disadvantages.
This sounds like multi-collinearity. When you add in another predictor variable (in this case number of charges at time of arrest) that is correlated with the other predictor variables, the results can and sometimes do change dramatically.
By the way, isn't number of charges at time of arrest simply the sum of the dummy variables? If so, this is a form of multi-collinearity where a linear combination of the x-variables is equal to (or nearly equal to) another linear combination of the x-variables.
There are several potential ways to mitigate this problem. In your case, I would simply remove the number of charges at the time of arrest, and then the results make sense. Other ways to mitigate this problem (not a complete list) are stepwise selection and logistic partial least squares, each of which have advantages and disadvantages.
Do you have any thoughts as to why it seems that researchers (to my knowledge) do not take the approach that I am using
No, this is not my field of endeavor and so I really can't comment on this.
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.