I have several things I'm trying to do.
(1) Create a predictive model based on all variables available (75 total) for a binary outcome. I have a lot of missing data that I was told not to impute. To my understanding decision trees and random forests handle missing data well and will still be able to produce a decent prediction model. However, I am using SAS University, which does not seem to support HPFOREST. Is there an alternative?
ERROR: Procedure HPFOREST not found.
(2) Build a logistic regression prediction model based on subset of participants who contain most variable information (>95%). The problem I run into is:
WARNING: There is a complete separation of data points in Step 2. The maximum likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
I assume this is a quasi-separation issue. However, firth does not work with selection procedures. Are there other ways to remedy this? Or would it better to go through the purposeful selection steps individually?
I read that reducing explanatory variables may help, which loops back to HPFOREST. I'd like to use a random forest to narrow down my variable candidates for the logistic model.
(3) Build a logistic regression prediction model with all participants (230) and variables with at least 90% of information.
Data Information:
N = 230 total participants
n = 115 participants with at least 95% variables filled
75 Total variables of interest
Subset data created from:
DATA CLEANED.CompleteCases95; set CLEANED.FilteredAnalytic; if cmiss (of _ALL_)/75 <= 0.05; *don't count visit_date or id; RUN; *Total rows: 115, Total columns: 77;
I am not set on using random forest. Any technique that handles large amount of missingness well will do. Thank you in advance!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.