Re: How to handle large datasets (up to 3 Mio observations) for all mo...

YG1992 · Posted 01-29-2018 11:05 AM

Hi everyone,

In my project I have to build different models for large datasets and some of them may have more than 3 million observations and hundreds of input variables. For logistic regression (LR) and decision trees (DT) the correspondent nodes work fine; but for some machine learning methods such as SVM, Random Forest, Gradient Boosting, k-Nearest-Neighbors and so on they sometimes fail to complete running with some error messages.

If I sample a small subsample and apply those methods with exactly the same hyper-parameter settings then everything is fine. That's why I conclude that those errors are related with sample size.

In conclusion, I wonder if there exist a way to allow me to use all the training data (e.g. 3 million x 0.7 = 2.1 million training observations) to build SVM, RF, GBDT, kNN and so on. I think that "Group" nodes may be helpful to do something like "batching" the data, but I am not sure and not clear how it will be like specifically.

If you have any suggestions you are welcome to discuss them with me and I would really appreciate it.

Thanks very much!

MikeStockstill · Posted 01-30-2018 08:41 AM

Hello YG1992 -

A first step is to examine the text of the error to find more specific information about the problem. Based on the text of the error, try some searches on this page:

http://support.sas.com/notes/

Example: if the errors are out-of-memory errors, then try notes such as this one.

61376 - Overcoming "insufficient memory ." and "parameter larger than documented limit" error messag...

If none of that information leads you to a resolution, then turn on the MPRINT option, create a model package, and contact technical support for assistance.

Have a great day.

How to handle large datasets (up to 3 Mio observations) for all models in SAS EM

Re: How to handle large datasets (up to 3 Mio observations) for all models in SAS EM

How to handle large datasets (up to 3 Mio observations) for all models in SAS EM

Re: How to handle large datasets (up to 3 Mio observations) for all models in SAS EM

SAS Innovate 2025: Save the Date