Re: How to "Split Data" (By Group Processing)

madpumpkinpie · Posted 01-26-2018 03:15 PM

Hi there,

I have googled similar questions but could not find answers that I can understand. So here I am asking for your help!

What I want to know is how to get results separately in a group.

Example (just copied and pasted from Excel)

ID Weight Treatment kcal
1   NW   A   400
2   NW   A   500
3   OW   A   560
4   NW   A   800
5   OW   A   490
6   NW   A   500
7   OW   A   400
8   OW   A   700
9   NW   A   900
1   NW   B   580
2   NW   B   600
3   OW   B   800
4   NW   B   500
5   OW   B   600
6   NW   B   800
7   OW   B   700
8   OW   B   500
9   NW   B   780
1   NW   C   570
2   NW   C   670
3   OW   C   570
4   NW   C   400
5   OW   C   600
6   NW   C   800
7   OW   C   800
8   OW   C   500
9   NW   C   800

In this example, I am going to run one-way repeated measures ANOVA to see if there is a treatment effect (A, B, C) on subsequent calorie intake (kcal). However, I want to see the result separately for each body weight status (NW=normal weight & OW=overweight). Please ignore the small sample size because this is just an example.

In SPSS, we can "split file" and then get results for both NW and OW separately in any analysis conducted after that.

I am an absolute beginner of SAS and have never edited the code. Just importing excel files for each analysis and selecting some commands. Therefore, I'd appreciate if you could explain in a comprehensive way if it involves code!

TIA

PaigeMiller · Posted 01-26-2018 03:18 PM

First, there's no need to split anything here ... no need to split the dataset ... no need to split the analysis; in fact, splitting the analysis would be the wrong thing to do from a statistical point of view.

proc glm;
    class weight treatment;
    model kcal=weight|treatment;
run;
quit;

Optionally you may want an LSMEANS statement.

--
Paige Miller

madpumpkinpie · Posted 01-26-2018 05:52 PM

Thank you for replying!

I understand what you mean, but then how can I compare the effect of treatment on food intake between NW and OW participants? It's because I found significant treatment effect in NW group (n=41) but not in OW group (n=12) when I ran One-way repeated measures ANOVA for each body weight group.

And what is this code for? Where to be put?

proc glm;
    class weight treatment;
    model kcal=weight|treatment;
run;
quit;

Thank you,

novinosrin · Posted 01-26-2018 03:39 PM

@PaigeMiller a class act.

@madpumpkinpie just in case, if you need for other purposes in future

data have;
input ID $  Weight $ Treatment $ kcal;
datalines;
1    NW    A    400
2    NW    A    500
3    OW    A    560
4    NW    A    800
5    OW    A    490
6    NW    A    500
7    OW    A    400
8    OW    A    700
9    NW    A    900
1    NW    B    580
2    NW    B    600
3    OW    B    800
4    NW    B    500
5    OW    B    600
6    NW    B    800
7    OW    B    700
8    OW    B    500
9    NW    B    780
1    NW    C    570
2    NW    C    670
3    OW    C    570
4    NW    C    400
5    OW    C    600
6    NW    C    800
7    OW    C    800
8    OW    C    500
9    NW    C    800
;
proc sort date=have;
by weight;
run;
data _null_;
if _n_=1 then do;
if 0 then set have;
 dcl hash h(dataset:'have(obs=0)',multidata:'y');
 h.definekey('Weight');
 h.definedata(all:'y');
 h.definedone();
 end;
 set have;
 by weight;
 if first.weight then h.clear();
 h.add();
 if last.weight then h.output(dataset: weight);
 run;

madpumpkinpie · Posted 01-26-2018 05:56 PM

Thank you for replying!

Does this code include both the command for splitting file and running ANOVA?

novinosrin · Posted 01-26-2018 05:59 PM

Nope Anova, funny enough, i have just enrolled for statistics courses as a full time student 🙂 The code just splits. Sorry about that

PaigeMiller · Posted 01-27-2018 10:13 AM

@madpumpkinpie wrote:

Thank you for replying!

Does this code include both the command for splitting file and running ANOVA?

No, as I said, there is no need to do any splitting when you do an ANOVA.

--
Paige Miller

madpumpkinpie · Posted 01-27-2018 10:32 AM

Hi there,

I'm sorry, but I don't get why no need of split...:( Could you explain?

Then if I want to see the separate result of the effect of treatment on calorie intake for different weight status, may I conduct one-way ANOVA twice (one for NW group and the other one for OW group)?

Thanks,

PaigeMiller · Posted 01-27-2018 12:51 PM

You conduct one ANOVA, where the effect of treatment and the effect of weight are both in the model. Splitting the data and conducting an ANOVA for OW and another ANOVA for NW is the wrong thing to do, statistically.

From this one analysis, you can determine the effect of treatment for the OW case, and a different effect of treatment for the NW case.

--
Paige Miller

art297 · Posted 01-27-2018 07:24 PM

@madpumpkinpie: Just wanted to point out a couple of things. First, I agree with @PaigeMiller, if you're doing the analysis to test hypotheses, the correct method is to use a single analysis.

However, so that you gain a better understanding of how SAS works, you could still get exactly what you asked for without having to "split" the data.

e.g., @PaigeMiller recommended:

proc glm data=have;
    class weight treatment;
    model kcal=weight|treatment;
run;

To do the same thing, separately for each level of treatment, one could use:

proc sort data=have;
  by treatment;
run;

proc glm data=have;
    class weight;
    by treatment;
    model kcal=weight;
run;

@novinosrin: While splitting the data isn't needed to solve the problem, this question was cross-posted on SAS-L @SAShole (i.e., Paul Dorfman) pointed out that no sort is needed when using the hash object to split such a file. I ran a test comparing your method with the one that Paul suggested, and I totally have to agree with him. He suggested:

data _null_ ; 
  dcl hash x() ; 
  x.defineKey ('weight') ; 
  x.defineData ('weight','h') ; 
  x.defineDone () ; 
  dcl hash h ; 
  do until (z) ; 
    set have end = z ; 
    if x.find() ne 0 then do ; 
      h = _new_ hash (dataset:'have(obs=0)', multidata:'y') ;
      h.defineKey ('weight') ; 
      h.defineData (all:'y') ; 
      h.defineDone () ; 
      x.add() ; 
    end ; 
    h.add() ; 
  end ; 
  dcl hiter i('x') ; 
  do while (i.next()=0) ; 
    h.output (dataset: weight) ; 
  end ; 
  stop ; 
run ;

Your solution would only be (very) slightly faster IF the data were already sorted.

Art, CEO, AnalystFinder.com

novinosrin · Posted 01-27-2018 07:52 PM

Thank you @art297 for even filling me in such amazing discussions. It's really a privilege although makes me very nervous to participate when you champs and your contemporaries do.

Anyway, i have tried another 9.4 method avoiding sort:

data have;
input ID $  Weight $ Treatment $ kcal;
datalines;
1    NW    A    400
2    NW    A    500
3    OW    A    560
4    NW    A    800
5    OW    A    490
6    NW    A    500
7    OW    A    400
8    OW    A    700
9    NW    A    900
1    NW    B    580
2    NW    B    600
3    OW    B    800
4    NW    B    500
5    OW    B    600
6    NW    B    800
7    OW    B    700
8    OW    B    500
9    NW    B    780
1    NW    C    570
2    NW    C    670
3    OW    C    570
4    NW    C    400
5    OW    C    600
6    NW    C    800
7    OW    C    800
8    OW    C    500
9    NW    C    800
;


data _null_;
if _n_=1 then do;
if 0 then set have;
 dcl hash h(dataset:'have',multidata:'y');
 h.definekey('Weight');
 h.definedata(all:'y');
 h.definedone();
 dcl hash h1(dataset:'have',duplicate:'r');
 h1.definekey('Weight');
 h1.definedata('weight');
 h1.definedone();
 dcl hiter i('h1') ; 
 dcl hash h2(dataset:'have(obs=0)',multidata:'y');
 h2.definekey('Weight');
 h2.definedata(all:'y');
 h2.definedone();
 end;
rc = i.first();
do while (rc = 0);
	h2.clear();
   do while(h.do_over(key:weight) eq 0);
    h2.add();
 	end;
	h2.output(dataset:weight);	
	 rc = i.next();
end;
run;

PaigeMiller · Posted 01-28-2018 07:17 AM

@art297 wrote:

@madpumpkinpie: Just wanted to point out a couple of things. First, I agree with @PaigeMiller, if you're doing the analysis to test hypotheses, the correct method is to use a single analysis.

However, so that you gain a better understanding of how SAS works, you could still get exactly what you asked for without having to "split" the data.

e.g., @PaigeMiller recommended:
proc glm data=have;
    class weight treatment;
    model kcal=weight|treatment;
run;
To do the same thing, separately for each level of treatment, one could use:
proc sort data=have;
  by treatment;
run;

proc glm data=have;
    class weight;
    by treatment;
    model kcal=weight;
run;

Well I feel that I should point out that this is NOT the same thing. The F-tests will be different; and thus correct if you don't split the analysis using BY groups, and incorrect if you do split the analysis by using BY groups.

--
Paige Miller

art297 · Posted 01-28-2018 09:41 AM

@PaigeMiller: Poor choice of words on my part. I didn't mean to imply that the by group analyses were either correct of supplied the same result. I'm well familiar with the effects (particularly on alpha) of doing multiple tests.

My post was simply to address the question that @madpumpkinpie originally asked.

Conversely, if one isn't doing hypothesis testing, but rather only data snooping (however frowned upon that may be), the approach does exist.

Art, CEO, AnalystFinder.com

hashman · Posted 04-29-2024 12:47 AM

Art,

Just need to point out that this method was invented by Richard A. DeVenezia. In fact, the whole idea of storing hash object references in another hash object has come from his SAS-L post following SUGI in Montreal where Koen Vyverman and I did the first user-created SUGI presentation on the SAS hash object unveiled by SAS in 2003. During the aftertalk Q&A someone (sorry, don't remember who) asked if a hash object could contain other hash objects as data. I did not think so - and replied accordingly.

A few days after the conference Richard posted a message on SAS-L where he demonstrated that not only that HOH (hash-of-hashes) was possible but how to use it in conjunction with the hash iterator to split an unsorted file. Basically, with this post alone, Richard laid the foundation of HOH and a practical example sufficient for anyone with a couple of hash crevices to expand its usage way beyond the task of file splitting. Later on, HOH has become a major part of the hash object book Don Henderson and I published in 2018.

art297 · Posted 04-29-2024 08:20 AM

Paul, Nice to hear from you but, while my name is attached to it, I don't think I wrote the post to which you were replying. I never got into incorporating hash into my code like you, Don, Richard and Kilovolt did. However, as I mentioned in response to another post in this thread, I haven't used SAS in years and am simply enjoying retirement. On a different note, I think that Montreal was the first global forum I ever attended. You bring back a bunch of memories.

Registration is open