turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Data Mining
- /
- Please Help.What's the minimum number of responses...

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

04-26-2013 10:31 AM

Hi,

What's the minimun number of responses is required to build a descent model. In terms of volume.....For example, my contacted people is 4,000 and the responses (Yes) are 700.

Is it enough to build a model?

Many Thanks

Alice

Accepted Solutions

Solution

Wednesday

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Wednesday

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.

The minimum number of responses needed to build a model depends on the modeling approach. For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter. You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms. For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model. Decision Trees do not have 'parameters' so it is not really possible to say.

In the end, you can consider the following:

* data mining problems typically have a large number of observations

* when you have a relatively small number of observations, you have to consider more simple models

* the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

* data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

* data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

* different modeling methods have different requirements

* the modeling methods will typically return errors or clearly problematic results when there are too few observations

* this often happens when there are a small number of events of interest for a categorical target

* your confidence in your conclusions should be lower when you have relatively few observations

* the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

In many cases people are modeling rare event scenarios. You will likely learn from experience how strong your conclusions can be for a given sample of data. I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents. You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem. Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.

I hope this helps!

Doug

All Replies

Solution

Wednesday

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Highlight
- Email to a Friend
- Report Inappropriate Content

Wednesday

The short answer is that "your mileage may vary" depending on your analytical needs and business objectives.

The minimum number of responses needed to build a model depends on the modeling approach. For instance, it is commonly said but rarely ever discussed in textbooks that for ordinary least squares regression models, you want at least 5 observations for each model parameter. You usually estimate the intercept (1 parameter) and then add one for each of the interval input variables (say, J parameters) and add k-1 parameters for each of your categorical input variables where k represents the number of levels for a particular categorical variable plus more if you want to consider any interactions or higher order terms. For neural network models, you might be better off having at least 15-20 observations for each parameter but there are far more parameters in a corresponding neural network model. Decision Trees do not have 'parameters' so it is not really possible to say.

In the end, you can consider the following:

* data mining problems typically have a large number of observations

* when you have a relatively small number of observations, you have to consider more simple models

* the predictive capability of those models with few observations will likely be less than that of a model computed on a larger sample from a population

* data is expensive and obtaining more data (let alone a great deal more data) is often not feasible or practical

* data based decisions are generally better than purely perception based decisions since the data improves your understanding about what is happening

* different modeling methods have different requirements

* the modeling methods will typically return errors or clearly problematic results when there are too few observations

* this often happens when there are a small number of events of interest for a categorical target

* your confidence in your conclusions should be lower when you have relatively few observations

* the accuracy of the prediction and the stability of the relationship being modeled must be considered in assessing the strength of your conclusions

In many cases people are modeling rare event scenarios. You will likely learn from experience how strong your conclusions can be for a given sample of data. I spoke with a direct marketing company that only needed a 2% response rate and didn't have much confidence in their models unless they had at least 5,000 respondents. You can't use this number directly because you are probably considering different model requirements which impacts model complexity and all but certainly a different analysis problem. Even if it is a similar problem in the same general area, you are likely analyzing data for a different company.

I hope this helps!

Doug