Building models with SAS Enterprise Miner, SAS Factory Miner, SAS Visual Data Mining and Machine Learning or just with programming

Please Help.What's the minimum number of responses required to build a model? Thank You

Accepted Solution Solved
Reply
Frequent Contributor
Posts: 96
Accepted Solution

Please Help.What's the minimum number of responses required to build a model? Thank You

Hi,

What's the minimun number of responses is required to build a descent model. In terms of volume.....For example, my contacted people is 4,000 and the responses (Yes) are 700.

Is it enough to build a model?

Many Thanks

Alice


Accepted Solutions
Solution
Wednesday
SAS Employee
Posts: 121

Re: Please Help.What's the minimum number of responses required to build a model? Thank You

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.  

 

The minimum number of responses needed to build a model depends on the modeling approach.  For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter.   You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms.   For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model.  Decision Trees do not have 'parameters' so it is not really possible to say.

 

In the end, you can consider the following:

   * data mining problems typically have a large number of observations

   * when you have a relatively small number of observations, you have to consider more simple models

   * the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

   * data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

   * data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

    * different modeling methods have different requirements

    * the modeling methods will typically return errors or clearly problematic results when there are too few observations

    * this often happens when there are a small number of events of interest for a categorical target

    * your confidence in your conclusions should be lower when you have relatively few observations

    * the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

 

In many cases people are modeling rare event scenarios.  You will likely learn from experience how strong your conclusions can be for a given sample of data.  I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents.  You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem.  Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.   

 

I hope this helps!

Doug 

 

View solution in original post


All Replies
Solution
Wednesday
SAS Employee
Posts: 121

Re: Please Help.What's the minimum number of responses required to build a model? Thank You

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.  

 

The minimum number of responses needed to build a model depends on the modeling approach.  For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter.   You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms.   For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model.  Decision Trees do not have 'parameters' so it is not really possible to say.

 

In the end, you can consider the following:

   * data mining problems typically have a large number of observations

   * when you have a relatively small number of observations, you have to consider more simple models

   * the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

   * data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

   * data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

    * different modeling methods have different requirements

    * the modeling methods will typically return errors or clearly problematic results when there are too few observations

    * this often happens when there are a small number of events of interest for a categorical target

    * your confidence in your conclusions should be lower when you have relatively few observations

    * the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

 

In many cases people are modeling rare event scenarios.  You will likely learn from experience how strong your conclusions can be for a given sample of data.  I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents.  You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem.  Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.   

 

I hope this helps!

Doug 

 

☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 1 reply
  • 189 views
  • 0 likes
  • 2 in conversation