Hi everyone,
In my project I have to build different models for large datasets and some of them may have more than 3 million observations and hundreds of input variables. For logistic regression (LR) and decision trees (DT) the correspondent nodes work fine; but for some machine learning methods such as SVM, Random Forest, Gradient Boosting, k-Nearest-Neighbors and so on they sometimes fail to complete running with some error messages.
If I sample a small subsample and apply those methods with exactly the same hyper-parameter settings then everything is fine. That's why I conclude that those errors are related with sample size.
In conclusion, I wonder if there exist a way to allow me to use all the training data (e.g. 3 million x 0.7 = 2.1 million training observations) to build SVM, RF, GBDT, kNN and so on. I think that "Group" nodes may be helpful to do something like "batching" the data, but I am not sure and not clear how it will be like specifically.
If you have any suggestions you are welcome to discuss them with me and I would really appreciate it.
Thanks very much!
Hello YG1992 -
A first step is to examine the text of the error to find more specific information about the problem. Based on the text of the error, try some searches on this page:
Example: if the errors are out-of-memory errors, then try notes such as this one.
If none of that information leads you to a resolution, then turn on the MPRINT option, create a model package, and contact technical support for assistance.
Have a great day.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.
Find more tutorials on the SAS Users YouTube channel.