BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Question
Fluorite | Level 6

Hi,

What's the minimun number of responses is required to build a descent model. In terms of volume.....For example, my contacted people is 4,000 and the responses (Yes) are 700.

Is it enough to build a model?

Many Thanks

Alice

1 ACCEPTED SOLUTION

Accepted Solutions
DougWielenga
SAS Employee

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.  

 

The minimum number of responses needed to build a model depends on the modeling approach.  For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter.   You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms.   For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model.  Decision Trees do not have 'parameters' so it is not really possible to say.

 

In the end, you can consider the following:

   * data mining problems typically have a large number of observations

   * when you have a relatively small number of observations, you have to consider more simple models

   * the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

   * data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

   * data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

    * different modeling methods have different requirements

    * the modeling methods will typically return errors or clearly problematic results when there are too few observations

    * this often happens when there are a small number of events of interest for a categorical target

    * your confidence in your conclusions should be lower when you have relatively few observations

    * the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

 

In many cases people are modeling rare event scenarios.  You will likely learn from experience how strong your conclusions can be for a given sample of data.  I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents.  You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem.  Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.   

 

I hope this helps!

Doug 

 

View solution in original post

1 REPLY 1
DougWielenga
SAS Employee

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.  

 

The minimum number of responses needed to build a model depends on the modeling approach.  For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter.   You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms.   For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model.  Decision Trees do not have 'parameters' so it is not really possible to say.

 

In the end, you can consider the following:

   * data mining problems typically have a large number of observations

   * when you have a relatively small number of observations, you have to consider more simple models

   * the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

   * data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

   * data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

    * different modeling methods have different requirements

    * the modeling methods will typically return errors or clearly problematic results when there are too few observations

    * this often happens when there are a small number of events of interest for a categorical target

    * your confidence in your conclusions should be lower when you have relatively few observations

    * the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

 

In many cases people are modeling rare event scenarios.  You will likely learn from experience how strong your conclusions can be for a given sample of data.  I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents.  You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem.  Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.   

 

I hope this helps!

Doug 

 

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 1 reply
  • 748 views
  • 0 likes
  • 2 in conversation