turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- Analytics
- /
- Stat Procs
- /
- one model vs multiple models?

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

06-18-2012 09:46 PM

Hello , All

Suppose I have a dataset with several variables ( say Var1-Var20);and suppose Var20 is a categorical variable with following values: low, medium, high. I have two choices:

Choice 1: build one single model using all variables (Var1-Var20)

Choice 2: split the original dataset into three so that dataset1 only contain those observations of which Var20 falls in "low", dataset2 only contain those observations of which Var20 falls in "medium", and dataset3 only contain those observations of which Var20 falls in "high"; and build three models using Var1-Var19 for dataset1, dataset2, and dataset3, respectively.

My question: which choice should I go for? If it is choice1, what is the advantage of this method? what dose choice2 miss?

Thank you very much for educating me.

Accepted Solutions

Solution

06-19-2012
08:06 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

06-19-2012 08:06 AM

Advantage of choice1: You use all the data.

Disadvantage of choice1: You use all the data. (Yes, that's what I said)

Suppose your data were such that 10% of the observations were from the low group, 20% from the medium group, and 70% from the high group. If you fit a simplistic model with only var20 (as you name it), you will fit all the other parameters based predominately on the values seen in the high group. You might follow your first instinct to fit separate models for each group, but then you run into the problem of comparing across models. You really have no way of testing directly whether the parameter for var1 is the same in each of the groups. You could examine the confidence limits, but it would not be as satisfying as creating confidence limits on the difference under choice1.

You might wish to consider a more complex model that includes interactions between the categorical variable and the continuous variables, thus giving parameter estimates that can be compared directly. However, you need to watch for having enough data to adequately fit the additional parameters, as you would be going from estimating 21 (20 vars plus an intercept) to as many as 58 (19 vars at three levels each plus an intercept, depending on the parameterization you use). To get good estimates, you really need three times as much data.

This is an opinion, and everyone has them, so take what you like and leave the rest.

Steve Denham

All Replies

Solution

06-19-2012
08:06 AM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

06-19-2012 08:06 AM

Advantage of choice1: You use all the data.

Disadvantage of choice1: You use all the data. (Yes, that's what I said)

Suppose your data were such that 10% of the observations were from the low group, 20% from the medium group, and 70% from the high group. If you fit a simplistic model with only var20 (as you name it), you will fit all the other parameters based predominately on the values seen in the high group. You might follow your first instinct to fit separate models for each group, but then you run into the problem of comparing across models. You really have no way of testing directly whether the parameter for var1 is the same in each of the groups. You could examine the confidence limits, but it would not be as satisfying as creating confidence limits on the difference under choice1.

You might wish to consider a more complex model that includes interactions between the categorical variable and the continuous variables, thus giving parameter estimates that can be compared directly. However, you need to watch for having enough data to adequately fit the additional parameters, as you would be going from estimating 21 (20 vars plus an intercept) to as many as 58 (19 vars at three levels each plus an intercept, depending on the parameterization you use). To get good estimates, you really need three times as much data.

This is an opinion, and everyone has them, so take what you like and leave the rest.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

06-19-2012 10:46 AM

Interesting question. I like Steve's answer. For my part, there are so many kinds of models, impossible to give anything but a very general (vague) answer.

Choice 1 : compare the data from three groups

Choice 2 : compare models describing the three groups

The choice depends essentially on the purpose of your model and the properties of your sample. If it is suspected that the phenomenon that you want to model operates differently in the three groups, for example, that it is associated with different variables, then it might be preferable to develop separate models (choice 2) for each group. If, on the other hand, you are interested in discriminating the three groups or to model something more general on which Var20 might or might not have an effect then you must model the data as a whole (choice 1) and, if necessary, adjust priors or weights for the properties of your sample.

Hth.

PG

PG

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to PGStats

06-19-2012 10:59 AM

I want to expand on what PG said:

**The choice depends essentially on the purpose of your model and the properties of your sample**.

This is the motto that should be posted over the top of every data analyst's computer screen. And it is what makes answering the question so difficult. What is the question that you are trying to address? Is this to be an exercise in data exploration, or are there well defined questions to be addressed? Are you looking for a parsimonious model that is suited for prediction, or are you interested in the interplay between predictors and the response in an attempt to find support for hypothetical relationships? If you add these questions on to what PG said, then you will have a very good beginning place to guide your decision.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

06-19-2012 11:17 PM

Dear Steve and PG,

Thank you very much for your help. I really appreciate it. I do not have specific questions; I am asking this one because someone asked me and I don't know the answer.

Referring to "a more complex model that includes interactions" (in Steve's post), if I have enough data, will such a model be my best pick?

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to doudou66

06-20-2012 07:43 AM

**The choice depends essentially on the purpose of your model and the properties of your sample**.

I can't stress that enough. Just because a model is more complex does not make it better or worse--it all depends on the context of its use. If I were doing a confirmatory analysis, and I believed that var1-var19 were all important factors to examine, and I had enough data, then I would probably fit a model that included all variables plus interactions of var20 with each of var1 through var19. I would then look at the results and see if they made sense, and see if I could eliminate some of the interactions as either nuisance or nonsensical. That would mean looking at a LOT of plots, because I really don't have the ability to think in a 20 dimensional space. Even after a model reduction, I would probably be left with something that would take substantial effort to interpret. I would be worried about collinearity in the continuous variables and highly leveraged points that probably tell me more about my data collection efforts than about the response. I would start to wonder if all of those continuous variables were truly independent of one another, and if any dependence was due to the presence of the categorical variable.

At some point, I would reach in the closet and get out my old texts on Mathematical Biology, and see if I might be better off trying to write some sort of structural system of equations, rather than piling everyone into the van and seeing who ended up in good seats. So once I again I say:

**The choice depends essentially on the purpose of your model and the properties of your sample**.

Good luck.

Steve Denham

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to SteveDenham

06-20-2012 06:54 PM

Thank you very much for your advice; it is really helpful and enlightening.