BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
doudou66
Calcite | Level 5

Hello , All

Suppose I have a dataset with several variables ( say Var1-Var20);and suppose Var20 is a categorical variable with following values: low, medium, high. I have two choices:

Choice 1: build one single model using all variables (Var1-Var20)

Choice 2: split the original dataset into three so that dataset1 only contain those observations of which Var20 falls in "low", dataset2 only contain those observations of which Var20 falls in "medium", and dataset3 only contain those observations of which Var20 falls in "high"; and build three models using Var1-Var19 for dataset1, dataset2, and dataset3, respectively.

My question: which choice should I go for? If it is choice1, what is the advantage of this method? what dose choice2 miss?

Thank you very much for educating me.

1 ACCEPTED SOLUTION

Accepted Solutions
SteveDenham
Jade | Level 19

Advantage of choice1: You use all the data.

Disadvantage of choice1: You use all the data.  (Yes, that's what I said)

Suppose your data were such that 10% of the observations were from the low group, 20% from the medium group, and 70% from the high group.  If you fit a simplistic model with only var20 (as you name it), you will fit all the other parameters based predominately on the values seen in the high group.  You might follow your first instinct to fit separate models for each group, but then you run into the problem of comparing across models.  You really have no way of testing directly whether the parameter for var1 is the same in each of the groups.  You could examine the confidence limits, but it would not be as satisfying as creating confidence limits on the difference under choice1.

You might wish to consider a more complex model that includes interactions between the categorical variable and the continuous variables, thus giving parameter estimates that can be compared directly.  However, you need to watch for having enough data to adequately fit the additional parameters, as you would be going from estimating 21 (20 vars plus an intercept) to as many as 58 (19 vars at three levels each plus an intercept, depending on the parameterization you use).  To get good estimates, you really need three times as much data.

This is an opinion, and everyone has them, so take what you like and leave the rest.

Steve Denham

View solution in original post

6 REPLIES 6
SteveDenham
Jade | Level 19

Advantage of choice1: You use all the data.

Disadvantage of choice1: You use all the data.  (Yes, that's what I said)

Suppose your data were such that 10% of the observations were from the low group, 20% from the medium group, and 70% from the high group.  If you fit a simplistic model with only var20 (as you name it), you will fit all the other parameters based predominately on the values seen in the high group.  You might follow your first instinct to fit separate models for each group, but then you run into the problem of comparing across models.  You really have no way of testing directly whether the parameter for var1 is the same in each of the groups.  You could examine the confidence limits, but it would not be as satisfying as creating confidence limits on the difference under choice1.

You might wish to consider a more complex model that includes interactions between the categorical variable and the continuous variables, thus giving parameter estimates that can be compared directly.  However, you need to watch for having enough data to adequately fit the additional parameters, as you would be going from estimating 21 (20 vars plus an intercept) to as many as 58 (19 vars at three levels each plus an intercept, depending on the parameterization you use).  To get good estimates, you really need three times as much data.

This is an opinion, and everyone has them, so take what you like and leave the rest.

Steve Denham

PGStats
Opal | Level 21

Interesting question. I like Steve's answer. For my part, there are so many kinds of models, impossible to give anything but a very general (vague) answer.

Choice 1 : compare the data from three groups

Choice 2 : compare models describing the three groups


The choice depends essentially on the purpose of your model and the properties of your sample. If it is suspected that the phenomenon that you want to model operates differently in the three groups, for example, that it is associated with different variables, then it might be preferable to develop separate models (choice 2) for each group. If, on the other hand, you are interested in discriminating the three groups or to model something more general on which Var20 might or might not have an effect then you must model the data as a whole (choice 1) and, if necessary, adjust priors or weights for the properties of your sample.

Hth.

PG

PG
SteveDenham
Jade | Level 19

I want to expand on what PG said:

The choice depends essentially on the purpose of your model and the properties of your sample.

This is the motto that should be posted over the top of every data analyst's computer screen.  And it is what makes answering the question so difficult.  What is the question that you are trying to address?  Is this to be an exercise in data exploration, or are there well defined questions to be addressed?  Are you looking for a parsimonious model that is suited for prediction, or are you interested in the interplay between predictors and the response in an attempt to find support for hypothetical relationships?  If you add these questions on to what PG said, then you will have a very good beginning place to guide your decision.

Steve Denham

doudou66
Calcite | Level 5

Dear Steve and PG,

Thank you very much for your help. I really appreciate it. I do not have specific questions; I am asking this one because someone asked me and I don't know the answer.

Referring to "a more complex model that includes interactions" (in Steve's post), if I have enough data, will such a model be my best pick?

SteveDenham
Jade | Level 19

The choice depends essentially on the purpose of your model and the properties of your sample.

I can't stress that enough.  Just because a model is more complex does not make it better or worse--it all depends on the context of its use.  If I were doing a confirmatory analysis, and I believed that var1-var19 were all important factors to examine, and I had enough data, then I would probably fit a model that included all variables plus interactions of var20 with each of var1 through var19.  I would then look at the results and see if they made sense, and see if I could eliminate some of the interactions as either nuisance or nonsensical.  That would mean looking at a LOT of plots, because I really don't have the ability to think in a 20 dimensional space.  Even after a model reduction, I would probably be left with something that would take substantial effort to interpret.  I would be worried about collinearity in the continuous variables and highly leveraged points that probably tell me more about my data collection efforts than about the response.  I would start to wonder if all of those continuous variables were truly independent of one another, and if any dependence was due to the presence of the categorical variable.

At some point, I would reach in the closet and get out my old texts on Mathematical Biology, and see if I might be better off trying to write some sort of structural system of equations, rather than piling everyone into the van and seeing who ended up in good seats. So once I again I say:

The choice depends essentially on the purpose of your model and the properties of your sample.

Good luck.

Steve Denham

doudou66
Calcite | Level 5

Thank you very much for your advice; it is really helpful and enlightening.

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 1711 views
  • 6 likes
  • 3 in conversation