BookmarkSubscribeRSS Feed
YG1992
Obsidian | Level 7

Hi everyone,

 

In my project I have to build different models for large datasets and some of them may have more than 3 million observations and hundreds of input variables. For logistic regression (LR) and decision trees (DT) the correspondent nodes work fine; but for some machine learning methods such as SVM, Random Forest, Gradient Boosting, k-Nearest-Neighbors and so on they sometimes fail to complete running with some error messages.

 

If I sample a small subsample and apply those methods with exactly the same hyper-parameter settings then everything is fine. That's why I conclude that those errors are related with sample size.

 

In conclusion, I wonder if there exist a way to allow me to use all the training data (e.g. 3 million x 0.7 = 2.1 million training observations) to build SVM, RF, GBDT, kNN and so on. I think that "Group" nodes may be helpful to do something like "batching" the data, but I am not sure and not clear how it will be like specifically.

 

If you have any suggestions you are welcome to discuss them with me and I would really appreciate it.

Thanks very much!

1 REPLY 1
MikeStockstill
SAS Employee

Hello YG1992 -

 

A first step is to examine the text of the error to find more specific information about the problem.  Based on the text of the error, try some searches on this page:

 

http://support.sas.com/notes/

 

 

Example: if the errors are out-of-memory errors, then try notes such as this one.

 

61376 - Overcoming "insufficient memory ." and "parameter larger than documented limit" error messag...

 

 

If none of that information leads you to a resolution, then turn on the MPRINT option, create a model package, and contact technical support for assistance.

 

Have a great day.

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 905 views
  • 0 likes
  • 2 in conversation