BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
amarikow57
Obsidian | Level 7

I have several things I'm trying to do.

(1) Create a predictive model based on all variables available (75 total) for a binary outcome. I have a lot of missing data that I was told not to impute. To my understanding decision trees and random forests handle missing data well and will still be able to produce a decent prediction model. However, I am using SAS University, which does not seem to support HPFOREST. Is there an alternative?

 ERROR: Procedure HPFOREST not found.

(2) Build a logistic regression prediction model based on subset of participants who contain most variable information (>95%). The problem I run into is:

 WARNING: There is a complete separation of data points in Step 2. The maximum likelihood estimate does not exist.
 WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood 
          iteration. Validity of the model fit is questionable.

I assume this is a quasi-separation issue. However, firth does not work with selection procedures. Are there other ways to remedy this? Or would it better to go through the purposeful selection steps individually?

I read that reducing explanatory variables may help, which loops back to HPFOREST. I'd like to use a random forest to narrow down my variable candidates for the logistic model.

 

(3) Build a logistic regression prediction model with all participants (230) and variables with at least 90% of information. 

 

Data Information: 

N = 230 total participants

n = 115 participants with at least 95% variables filled

75 Total variables of interest

 

Subset data created from:

DATA CLEANED.CompleteCases95;
 set CLEANED.FilteredAnalytic;
 if cmiss (of _ALL_)/75 <= 0.05; *don't count visit_date or id;
RUN; *Total rows: 115, Total columns: 77;

 

I am not set on using random forest. Any technique that handles large amount of missingness well will do. Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee
Please do a variable selection , optimal binning of interval inputs and then try Gradient Boosting. Finally compare the performance with the Decision Tree model.

View solution in original post

1 REPLY 1
gcjfernandez
SAS Employee
Please do a variable selection , optimal binning of interval inputs and then try Gradient Boosting. Finally compare the performance with the Decision Tree model.

sas-innovate-2026-white.png



April 27 – 30 | Gaylord Texan | Grapevine, Texas

Registration is open

Walk in ready to learn. Walk out ready to deliver. This is the data and AI conference you can't afford to miss.
Register now and lock in 2025 pricing—just $495!

Register now

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 2054 views
  • 0 likes
  • 2 in conversation