How to "Split Data" (By Group Processing)

madpumpkinpie · Posted 01-26-2018 03:15 PM

Hi there,

I have googled similar questions but could not find answers that I can understand. So here I am asking for your help!

What I want to know is how to get results separately in a group.

Example (just copied and pasted from Excel)

ID Weight Treatment kcal
1   NW   A   400
2   NW   A   500
3   OW   A   560
4   NW   A   800
5   OW   A   490
6   NW   A   500
7   OW   A   400
8   OW   A   700
9   NW   A   900
1   NW   B   580
2   NW   B   600
3   OW   B   800
4   NW   B   500
5   OW   B   600
6   NW   B   800
7   OW   B   700
8   OW   B   500
9   NW   B   780
1   NW   C   570
2   NW   C   670
3   OW   C   570
4   NW   C   400
5   OW   C   600
6   NW   C   800
7   OW   C   800
8   OW   C   500
9   NW   C   800

In this example, I am going to run one-way repeated measures ANOVA to see if there is a treatment effect (A, B, C) on subsequent calorie intake (kcal). However, I want to see the result separately for each body weight status (NW=normal weight & OW=overweight). Please ignore the small sample size because this is just an example.

In SPSS, we can "split file" and then get results for both NW and OW separately in any analysis conducted after that.

I am an absolute beginner of SAS and have never edited the code. Just importing excel files for each analysis and selecting some commands. Therefore, I'd appreciate if you could explain in a comprehensive way if it involves code!

TIA

PaigeMiller · Posted 01-26-2018 03:18 PM

First, there's no need to split anything here ... no need to split the dataset ... no need to split the analysis; in fact, splitting the analysis would be the wrong thing to do from a statistical point of view.

proc glm;
    class weight treatment;
    model kcal=weight|treatment;
run;
quit;

Optionally you may want an LSMEANS statement.

--
Paige Miller

madpumpkinpie · Posted 01-26-2018 05:52 PM

Thank you for replying!

I understand what you mean, but then how can I compare the effect of treatment on food intake between NW and OW participants? It's because I found significant treatment effect in NW group (n=41) but not in OW group (n=12) when I ran One-way repeated measures ANOVA for each body weight group.

And what is this code for? Where to be put?

proc glm;
    class weight treatment;
    model kcal=weight|treatment;
run;
quit;

Thank you,

novinosrin · Posted 01-26-2018 03:39 PM

@PaigeMiller a class act.

@madpumpkinpie just in case, if you need for other purposes in future

data have;
input ID $  Weight $ Treatment $ kcal;
datalines;
1    NW    A    400
2    NW    A    500
3    OW    A    560
4    NW    A    800
5    OW    A    490
6    NW    A    500
7    OW    A    400
8    OW    A    700
9    NW    A    900
1    NW    B    580
2    NW    B    600
3    OW    B    800
4    NW    B    500
5    OW    B    600
6    NW    B    800
7    OW    B    700
8    OW    B    500
9    NW    B    780
1    NW    C    570
2    NW    C    670
3    OW    C    570
4    NW    C    400
5    OW    C    600
6    NW    C    800
7    OW    C    800
8    OW    C    500
9    NW    C    800
;
proc sort date=have;
by weight;
run;
data _null_;
if _n_=1 then do;
if 0 then set have;
 dcl hash h(dataset:'have(obs=0)',multidata:'y');
 h.definekey('Weight');
 h.definedata(all:'y');
 h.definedone();
 end;
 set have;
 by weight;
 if first.weight then h.clear();
 h.add();
 if last.weight then h.output(dataset: weight);
 run;

madpumpkinpie · Posted 01-26-2018 05:56 PM

Thank you for replying!

Does this code include both the command for splitting file and running ANOVA?

novinosrin · Posted 01-26-2018 05:59 PM

Nope Anova, funny enough, i have just enrolled for statistics courses as a full time student 🙂 The code just splits. Sorry about that

PaigeMiller · Posted 01-27-2018 10:13 AM

@madpumpkinpie wrote:

Thank you for replying!

Does this code include both the command for splitting file and running ANOVA?

No, as I said, there is no need to do any splitting when you do an ANOVA.

--
Paige Miller

madpumpkinpie · Posted 01-27-2018 10:32 AM

Hi there,

I'm sorry, but I don't get why no need of split...:( Could you explain?

Then if I want to see the separate result of the effect of treatment on calorie intake for different weight status, may I conduct one-way ANOVA twice (one for NW group and the other one for OW group)?

Thanks,

PaigeMiller · Posted 01-27-2018 12:51 PM

You conduct one ANOVA, where the effect of treatment and the effect of weight are both in the model. Splitting the data and conducting an ANOVA for OW and another ANOVA for NW is the wrong thing to do, statistically.

From this one analysis, you can determine the effect of treatment for the OW case, and a different effect of treatment for the NW case.

--
Paige Miller

art297 · Posted 01-27-2018 07:24 PM

@madpumpkinpie: Just wanted to point out a couple of things. First, I agree with @PaigeMiller, if you're doing the analysis to test hypotheses, the correct method is to use a single analysis.

However, so that you gain a better understanding of how SAS works, you could still get exactly what you asked for without having to "split" the data.

e.g., @PaigeMiller recommended:

proc glm data=have;
    class weight treatment;
    model kcal=weight|treatment;
run;

To do the same thing, separately for each level of treatment, one could use:

proc sort data=have;
  by treatment;
run;

proc glm data=have;
    class weight;
    by treatment;
    model kcal=weight;
run;

@novinosrin: While splitting the data isn't needed to solve the problem, this question was cross-posted on SAS-L @SAShole (i.e., Paul Dorfman) pointed out that no sort is needed when using the hash object to split such a file. I ran a test comparing your method with the one that Paul suggested, and I totally have to agree with him. He suggested:

data _null_ ; 
  dcl hash x() ; 
  x.defineKey ('weight') ; 
  x.defineData ('weight','h') ; 
  x.defineDone () ; 
  dcl hash h ; 
  do until (z) ; 
    set have end = z ; 
    if x.find() ne 0 then do ; 
      h = _new_ hash (dataset:'have(obs=0)', multidata:'y') ;
      h.defineKey ('weight') ; 
      h.defineData (all:'y') ; 
      h.defineDone () ; 
      x.add() ; 
    end ; 
    h.add() ; 
  end ; 
  dcl hiter i('x') ; 
  do while (i.next()=0) ; 
    h.output (dataset: weight) ; 
  end ; 
  stop ; 
run ;

Your solution would only be (very) slightly faster IF the data were already sorted.

Art, CEO, AnalystFinder.com

novinosrin · Posted 01-27-2018 07:52 PM

Thank you @art297 for even filling me in such amazing discussions. It's really a privilege although makes me very nervous to participate when you champs and your contemporaries do.

Anyway, i have tried another 9.4 method avoiding sort:

data have;
input ID $  Weight $ Treatment $ kcal;
datalines;
1    NW    A    400
2    NW    A    500
3    OW    A    560
4    NW    A    800
5    OW    A    490
6    NW    A    500
7    OW    A    400
8    OW    A    700
9    NW    A    900
1    NW    B    580
2    NW    B    600
3    OW    B    800
4    NW    B    500
5    OW    B    600
6    NW    B    800
7    OW    B    700
8    OW    B    500
9    NW    B    780
1    NW    C    570
2    NW    C    670
3    OW    C    570
4    NW    C    400
5    OW    C    600
6    NW    C    800
7    OW    C    800
8    OW    C    500
9    NW    C    800
;


data _null_;
if _n_=1 then do;
if 0 then set have;
 dcl hash h(dataset:'have',multidata:'y');
 h.definekey('Weight');
 h.definedata(all:'y');
 h.definedone();
 dcl hash h1(dataset:'have',duplicate:'r');
 h1.definekey('Weight');
 h1.definedata('weight');
 h1.definedone();
 dcl hiter i('h1') ; 
 dcl hash h2(dataset:'have(obs=0)',multidata:'y');
 h2.definekey('Weight');
 h2.definedata(all:'y');
 h2.definedone();
 end;
rc = i.first();
do while (rc = 0);
	h2.clear();
   do while(h.do_over(key:weight) eq 0);
    h2.add();
 	end;
	h2.output(dataset:weight);	
	 rc = i.next();
end;
run;

PaigeMiller · Posted 01-28-2018 07:17 AM

@art297 wrote:

@madpumpkinpie: Just wanted to point out a couple of things. First, I agree with @PaigeMiller, if you're doing the analysis to test hypotheses, the correct method is to use a single analysis.

However, so that you gain a better understanding of how SAS works, you could still get exactly what you asked for without having to "split" the data.

e.g., @PaigeMiller recommended:
proc glm data=have;
    class weight treatment;
    model kcal=weight|treatment;
run;
To do the same thing, separately for each level of treatment, one could use:
proc sort data=have;
  by treatment;
run;

proc glm data=have;
    class weight;
    by treatment;
    model kcal=weight;
run;

Well I feel that I should point out that this is NOT the same thing. The F-tests will be different; and thus correct if you don't split the analysis using BY groups, and incorrect if you do split the analysis by using BY groups.

--
Paige Miller

art297 · Posted 01-28-2018 09:41 AM

@PaigeMiller: Poor choice of words on my part. I didn't mean to imply that the by group analyses were either correct of supplied the same result. I'm well familiar with the effects (particularly on alpha) of doing multiple tests.

My post was simply to address the question that @madpumpkinpie originally asked.

Conversely, if one isn't doing hypothesis testing, but rather only data snooping (however frowned upon that may be), the approach does exist.

Art, CEO, AnalystFinder.com

How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)

Re: How to "Split Data" (By Group Processing)