Solved: Re: Are there limits to number of variables/columns in Enterprise Mine...

Ullsokk · Posted 08-25-2017 06:57 AM

Are there limits to how many variable SAS Enterprise Miner can handle? I am feature engineering a data set that migth potentially contain tens of thosands of variables across 50-100 000 cases. At which point does the data become too large for conventional methods?

DougWielenga · Posted 08-25-2017 04:43 PM

It really depends on what 'classical methods' you are trying to employ and the amount of computing power at your disposal. Even with huge datasets, there are strategies to allow you to obtain results even when the available data dwarfs your computing resources capabilities. For example, one strategy is to partition the data and/or choose a smaller training sample to do initial variable selection prior to doing variable selection and fitting a model against the larger data set with what is now a smaller set of variables. SAS Enterprise Miner is designed for handling a huge number of observations and variables.

One approach to an excessive number of variables is to consider running HPFOREST for variable selection. HPFOREST will take samples of variables and samples of input data and run tree models to identify which variables are the most useful across many trees. Not every observation and not every variable is used in any given tree but the ability of the tree-based methods to identify relationships without needing their exact functional form makes this a very powerful pre-processing tool. In reality, doing this type of variable selection prior to fitting a model can greatly reduce the time it would take to build a model and yield a much more powerful model.

If by classical models you mean regression based-methods, there are ways to do variable selection there as well prior to running some type of stepwise model that will likewise greatly reduce the number of inputs. If you are running into problems based on your data and computing power, provide some details and we can likely recommend some strategies for how you can proceed. There is no hard limit, however, and the strategies you might need to employ are really a function of your modeling approach, data, and computing resources.

Hope this helps!

Doug

View solution in original post

RW9 · Posted 08-25-2017 07:13 AM

I don't know of the limits on datasets, the max read length is 32767 so you might hit problems around that (i.e. cat all your variable names together and if they are greater than that...).

To my mind however, even data with 50-100 columns becomes unweildy, making it hard to look at (imagine trying to pick out 2 or 3 datapoints in a 1000 * 10000 square of data).

Far simpler to model the data either by using relational methodology (i.e. breaking the data out into blocks of like data with mergable ids), or normalising the data so data goes down rather than across, you can have billions of rows of data for instance no probs.

DougWielenga · Posted 08-25-2017 04:43 PM

It really depends on what 'classical methods' you are trying to employ and the amount of computing power at your disposal. Even with huge datasets, there are strategies to allow you to obtain results even when the available data dwarfs your computing resources capabilities. For example, one strategy is to partition the data and/or choose a smaller training sample to do initial variable selection prior to doing variable selection and fitting a model against the larger data set with what is now a smaller set of variables. SAS Enterprise Miner is designed for handling a huge number of observations and variables.

One approach to an excessive number of variables is to consider running HPFOREST for variable selection. HPFOREST will take samples of variables and samples of input data and run tree models to identify which variables are the most useful across many trees. Not every observation and not every variable is used in any given tree but the ability of the tree-based methods to identify relationships without needing their exact functional form makes this a very powerful pre-processing tool. In reality, doing this type of variable selection prior to fitting a model can greatly reduce the time it would take to build a model and yield a much more powerful model.

If by classical models you mean regression based-methods, there are ways to do variable selection there as well prior to running some type of stepwise model that will likewise greatly reduce the number of inputs. If you are running into problems based on your data and computing power, provide some details and we can likely recommend some strategies for how you can proceed. There is no hard limit, however, and the strategies you might need to employ are really a function of your modeling approach, data, and computing resources.

Hope this helps!

Doug

Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Re: Are there limits to number of variables/columns in Enterprise Miner

Ready to join fellow brilliant minds for the SAS Hackathon?