Solved: calculating mean by id and year

zihdonv19 · Posted 02-01-2024 11:28 AM

I have a dataset like this:

data your_dataset;
    input id year weight seq;
    datalines;
1 2000 12 0
1 2005 14 1
1 2006 23 2
2 2001 12 0
2 2001 45 1
2 2001 42 2
2 2002 75 3
2 2002 45 4
3 2003 97 0
3 2004 90 1
4 2010 56 0
4 2011 38 1
4 2012 91 2
4 2012 31 3
4 2014 16 4
5 2013 46 0
5 2013 46 1
5 2014 55 2
5 2015 64 3
5 2016 26 4
5 2017 36 5
5 2018 87 6
5 2019 46 7
;
run;

This is what I would like to get:

data your_dataset;
    input id time0 time1 time2 time3 time4 time5 time6;
    datalines;
1 12 14 23 . . . .
2 33 60 . . . . .
3 97 90 . . . . .
4 56 38 61 16 . . .
5 46 55 64 26 36 87 46
;
run;

The rule is:

for people with same id and year, I only want the mean weight of this year. For example, for id2 with year 2001, the value in the desired table is 33, because (12+45+42)/3=33. Due to 2001 is the year that id2 had the first weight record, it is located in the colomn "time0" in the desired table.

PaigeMiller · Posted 02-01-2024 12:25 PM

First, thankk you for providing the data as working SAS data step code. You make our job much easier, and thus you get a faster and more correct answer. Along those lines, there is no point in attaching data in a Microsoft Office document, as many of us will not download this attachment as it could be a security threat.

The first step is to create a sequence number from the original data. This is done in data set HAVE2. Once that is done, PROC REPORT does what PROC REPORT does.

data have;
    input id year weight seq;
    datalines;
1 2000 12 0
1 2005 14 1
1 2006 23 2
2 2001 12 0
2 2001 45 1
2 2001 42 2
2 2002 75 3
2 2002 45 4
3 2003 97 0
3 2004 90 1
4 2010 56 0
4 2011 38 1
4 2012 91 2
4 2012 31 3
4 2014 16 4
5 2013 46 0
5 2013 46 1
5 2014 55 2
5 2015 64 3
5 2016 26 4
5 2017 36 5
5 2018 87 6
5 2019 46 7
;

data have2;
    set have;
    by id year;
    if first.id then sequence=0;
    if first.year and not first.id then sequence+1;
run;

proc report data=have2;
    columns id sequence,weight;
    define id/group;
    define sequence/across 'Time';
    define weight/mean;
run;

--
Paige Miller

View solution in original post

ballardw · Posted 02-01-2024 12:20 PM

A big question is do you actually need a data set? A data set in that "wide" format is hard to use for almost any purpose.

But a Report is actually fairly easy, and easier to understand.

One quick example:

proc tabulate data= your_dataset;
   class id year;
   var weight;
   table id,
         year*weight*mean
   ;
run;

Personally I find your use of Time0 etc very confusing as shown in your desired set as it completely hides details like for Id=1 there is a multiyear gap between time0 and time1.

There are lots of appearance option changes that can be made to the tabulate code, or alternately Proc Report will do similar.

You would have to convince me that there is a real advantage to that desired data set before actually making one.

zihdonv19 · Posted 02-05-2024 10:25 AM

Hi!

Thank you for your reply.

The reason that I want a wide format is that I need to use proc traj to do trajectory analysis which requires this kind of dataset. In my real data, I only have data for 5 years, so even using wide format my desired dataset won't look too messy..i guess

Tom · Posted 02-05-2024 10:56 AM

Once you have the sequence variable you can use it with the ID statement of PROC TRANSPOSE. Use the PREFIX= option to set the base name of the new variables.

proc transpose data=have out=want(drop=_name_) prefix=YEAR ;
  by id;
  id sequence ;
  var xxx ;
run;

zihdonv19 · Posted 02-05-2024 11:17 AM

Thanks for your additional comments. but it gives error "ERROR: The ID value "YEAR0" occurs twice in the same BY group." I guess it's because for some person, they have multiple records in the same year?

PaigeMiller · Posted 02-01-2024 12:25 PM

First, thankk you for providing the data as working SAS data step code. You make our job much easier, and thus you get a faster and more correct answer. Along those lines, there is no point in attaching data in a Microsoft Office document, as many of us will not download this attachment as it could be a security threat.

The first step is to create a sequence number from the original data. This is done in data set HAVE2. Once that is done, PROC REPORT does what PROC REPORT does.

data have;
    input id year weight seq;
    datalines;
1 2000 12 0
1 2005 14 1
1 2006 23 2
2 2001 12 0
2 2001 45 1
2 2001 42 2
2 2002 75 3
2 2002 45 4
3 2003 97 0
3 2004 90 1
4 2010 56 0
4 2011 38 1
4 2012 91 2
4 2012 31 3
4 2014 16 4
5 2013 46 0
5 2013 46 1
5 2014 55 2
5 2015 64 3
5 2016 26 4
5 2017 36 5
5 2018 87 6
5 2019 46 7
;

data have2;
    set have;
    by id year;
    if first.id then sequence=0;
    if first.year and not first.id then sequence+1;
run;

proc report data=have2;
    columns id sequence,weight;
    define id/group;
    define sequence/across 'Time';
    define weight/mean;
run;

--
Paige Miller

Tom · Posted 02-01-2024 07:03 PM

It might be better to number by the number of years since the first year for that ID.

data have2;
  set have;
  by id year;
  if first.id then baseyr = year;
  retain baseyr ;
  sequence = year - baseyr;
run;

proc report data=have2;
  columns id baseyr weight,sequence;
  define id/group;
  define baseyr / group;
  define sequence/across 'Time';
  define weight/mean ' ';
run;

                                                  Time
id     baseyr          0          1          2          3          4          5          6
 1       2000         12          .          .          .          .         14         23
 2       2001         33         60          .          .          .          .          .
 3       2003         97         90          .          .          .          .          .
 4       2010         56         38         61          .         16          .          .
 5       2013         46         55         64         26         36         87         46

calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Re: calculating mean by id and year

Classroom Training Available!