BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
amarikow57
Obsidian | Level 7

I have several things I'm trying to do.

(1) Create a predictive model based on all variables available (75 total) for a binary outcome. I have a lot of missing data that I was told not to impute. To my understanding decision trees and random forests handle missing data well and will still be able to produce a decent prediction model. However, I am using SAS University, which does not seem to support HPFOREST. Is there an alternative?

 ERROR: Procedure HPFOREST not found.

(2) Build a logistic regression prediction model based on subset of participants who contain most variable information (>95%). The problem I run into is:

 WARNING: There is a complete separation of data points in Step 2. The maximum likelihood estimate does not exist.
 WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood 
          iteration. Validity of the model fit is questionable.

I assume this is a quasi-separation issue. However, firth does not work with selection procedures. Are there other ways to remedy this? Or would it better to go through the purposeful selection steps individually?

I read that reducing explanatory variables may help, which loops back to HPFOREST. I'd like to use a random forest to narrow down my variable candidates for the logistic model.

 

(3) Build a logistic regression prediction model with all participants (230) and variables with at least 90% of information. 

 

Data Information: 

N = 230 total participants

n = 115 participants with at least 95% variables filled

75 Total variables of interest

 

Subset data created from:

DATA CLEANED.CompleteCases95;
 set CLEANED.FilteredAnalytic;
 if cmiss (of _ALL_)/75 <= 0.05; *don't count visit_date or id;
RUN; *Total rows: 115, Total columns: 77;

 

I am not set on using random forest. Any technique that handles large amount of missingness well will do. Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
gcjfernandez
SAS Employee
Please do a variable selection , optimal binning of interval inputs and then try Gradient Boosting. Finally compare the performance with the Decision Tree model.

View solution in original post

1 REPLY 1
gcjfernandez
SAS Employee
Please do a variable selection , optimal binning of interval inputs and then try Gradient Boosting. Finally compare the performance with the Decision Tree model.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 1085 views
  • 0 likes
  • 2 in conversation