BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Ullsokk
Pyrite | Level 9

Are there limits to how many variable SAS Enterprise Miner can handle? I am feature engineering a data set that migth potentially contain tens of thosands of variables across 50-100 000 cases. At which point does the data become too large for conventional methods?

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

It really depends on what 'classical methods' you are trying to employ and the amount of computing power at your disposal.  Even with huge datasets, there are strategies to allow you to obtain results even when the available data dwarfs your computing resources capabilities.  For example, one strategy is to partition the data and/or choose a smaller training sample to do initial variable selection prior to doing variable selection and fitting a model against the larger data set with what is now a smaller set of variables.  SAS Enterprise Miner is designed for handling a huge number of observations and variables.   

One approach to an excessive number of variables is to consider running HPFOREST for variable selection.  HPFOREST will take samples of variables and samples of input data and run tree models to identify which variables are the most useful across many trees.  Not every observation and not every variable is used in any given tree but the ability of the tree-based methods to identify relationships without needing their exact functional form makes this a very powerful pre-processing tool.   In reality, doing this type of variable selection prior to fitting a model can greatly reduce the time it would take to build a model and yield a much more powerful model.   

If by classical models you mean regression based-methods, there are ways to do variable selection there as well prior to running some type of stepwise model that will likewise greatly reduce the number of inputs.  If you are running into problems based on your data and computing power, provide some details and we can likely recommend some strategies for how you can proceed.  There is no hard limit, however, and the strategies you might need to employ are really a function of your modeling approach, data, and computing resources.

 

Hope this helps!

Doug

View solution in original post

2 REPLIES 2
RW9
Diamond | Level 26 RW9
Diamond | Level 26

I don't know of the limits on datasets, the max read length is 32767 so you might hit problems around that (i.e. cat all your variable names together and if they are greater than that...).

To my mind however, even data with 50-100 columns becomes unweildy, making it hard to look at (imagine trying to pick out 2 or 3 datapoints in a 1000 * 10000 square of data).

Far simpler to model the data either by using relational methodology (i.e. breaking the data out into blocks of like data with mergable ids), or normalising the data so data goes down rather than across, you can have billions of rows of data for instance no probs.

DougWielenga
SAS Employee

It really depends on what 'classical methods' you are trying to employ and the amount of computing power at your disposal.  Even with huge datasets, there are strategies to allow you to obtain results even when the available data dwarfs your computing resources capabilities.  For example, one strategy is to partition the data and/or choose a smaller training sample to do initial variable selection prior to doing variable selection and fitting a model against the larger data set with what is now a smaller set of variables.  SAS Enterprise Miner is designed for handling a huge number of observations and variables.   

One approach to an excessive number of variables is to consider running HPFOREST for variable selection.  HPFOREST will take samples of variables and samples of input data and run tree models to identify which variables are the most useful across many trees.  Not every observation and not every variable is used in any given tree but the ability of the tree-based methods to identify relationships without needing their exact functional form makes this a very powerful pre-processing tool.   In reality, doing this type of variable selection prior to fitting a model can greatly reduce the time it would take to build a model and yield a much more powerful model.   

If by classical models you mean regression based-methods, there are ways to do variable selection there as well prior to running some type of stepwise model that will likewise greatly reduce the number of inputs.  If you are running into problems based on your data and computing power, provide some details and we can likely recommend some strategies for how you can proceed.  There is no hard limit, however, and the strategies you might need to employ are really a function of your modeling approach, data, and computing resources.

 

Hope this helps!

Doug

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 7302 views
  • 2 likes
  • 3 in conversation