Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- SAS Data Science
- /
- Alternative to HPFOREST in SAS University (data contains missing obser...

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 03-06-2021 03:30 PM
(1084 views)

I have several things I'm trying to do.

**(1)** Create a predictive model based on all variables available (75 total) for a binary outcome. I have a lot of missing data that I was __told not to impute__. To my understanding decision trees and random forests handle missing data well and will still be able to produce a decent prediction model. However, I am using SAS University, which does not seem to support HPFOREST. Is there an alternative?

` ERROR: Procedure HPFOREST not found.`

**(2)** Build a logistic regression prediction model based on subset of participants who contain most variable information (>95%). The problem I run into is:

```
WARNING: There is a complete separation of data points in Step 2. The maximum likelihood estimate does not exist.
WARNING: The LOGISTIC procedure continues in spite of the above warning. Results shown are based on the last maximum likelihood
iteration. Validity of the model fit is questionable.
```

I assume this is a quasi-separation issue. However, firth does not work with selection procedures. Are there other ways to remedy this? Or would it better to go through the purposeful selection steps individually?

I read that reducing explanatory variables may help, which loops back to HPFOREST. I'd like to use a random forest to narrow down my variable candidates for the logistic model.

**(3)** Build a logistic regression prediction model with all participants (230) and variables with at least 90% of information.

Data Information:

N = 230 total participants

n = 115 participants with at least 95% variables filled

75 Total variables of interest

Subset data created from:

DATA CLEANED.CompleteCases95; set CLEANED.FilteredAnalytic; if cmiss (of _ALL_)/75 <= 0.05; *don't count visit_date or id; RUN; *Total rows: 115, Total columns: 77;

I am not set on using random forest. Any technique that handles large amount of missingness well will do. Thank you in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Please do a variable selection , optimal binning of interval inputs and then try Gradient Boosting. Finally compare the performance with the Decision Tree model.

1 REPLY 1

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.