DATA Step, Macro, Functions and more

How to create a new variable based on the max value of a repeated measurement for each a subject

Accepted Solution Solved
Reply
Contributor
Posts: 51
Accepted Solution

How to create a new variable based on the max value of a repeated measurement for each a subject

Each subject in my dataset potentially has repeat measurements (where Period>1) and I am trying to figure out who has repeat measurements and how many measurements total each subject has. Some sample code is below (the dates are not reading in properly for some reason but I think the code still works for the purposes of this question):

 

data temporary;

INPUT 

studyid$ Period StartDate MMDDYY10. GI_C ;

format Startdate yymmdd10.;

datalines;

097531  1         02/01/2007        0

222482  1         06/02/2012        0

092203  1         05/07/2010        1

090624  1         06/16/2007        0

118646  1         04/28/2007        0

093507  1         03/16/2007        0

080279  1         04/18/2007        0

1045110  1        05/11/2007        0

1633011  1        08/01/2007        0

1633011  2        02/21/2008        0

0607212  1        02/06/2008        1

1932914  1        01/31/2007        0

1932914  2        02/11/2009        0

1932914  3        05/19/2009        0

1932914  4        09/22/2009        0

1932914  5        01/09/2010        0

1932914  6        02/23/2010        0

1932914  7        12/01/2010        0

1932914  8        05/10/2011        0

1932914  9        08/03/2013        0

;

run;

proc print data=temporary; run;

 

Subjects with only one measurement have only one row with Period=1. In the sample code above, while many people just have one period (Period=1) some have multiple periods such as Study ID 1633011 who has 2 periods and studyID 1932914 who has 9 periods. Is there a way to create a code that will identify and apply the highest value available in a column (the highest value of a variable) for a particular subject? I want to count how many periods are available for each person.  I would like to create a new variable (Periods_no) that tells you how many total periods that subject has overall, as follows:

 

studyid  Period   StartDate           GI_C  Periods_no

097531  1          02/01/2007         0                      1

222482  1          06/02/2012         0                      1

092203  1          05/07/2010         1                      1

090624  1          06/16/2007         0                      1

118646  1          04/28/2007         0                      1

093507  1          03/16/2007         0                      1

080279  1          04/18/2007         0                      1

1045110 1          05/11/2007         0                      1

1633011 1          08/01/2007         0                      2

1633011 2          02/21/2008         0                      2

0607212 1          02/06/2008         1                      1

1932914 1          01/31/2007         0                      9

1932914 2          02/11/2009         0                      9

1932914 3          05/19/2009         0                      9

1932914 4          09/22/2009         0                      9

1932914 5          01/09/2010         0                      9

1932914 6          02/23/2010         0                      9

1932914 7          12/01/2010         0                      9

1932914 8          05/10/2011         0                      9

1932914 9          08/03/2013         0                      9

 

Thank you very much in advance.

 


Accepted Solutions
Solution
‎02-07-2017 11:57 AM
SAS Employee
Posts: 20

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question Smiley Happy

 

View solution in original post


All Replies
Super User
Super User
Posts: 7,942

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Why is subject number given column name studyid?  Do you need the count in the data you have presented, or just as a summary?  If its just a sequential count, then maybe:

proc sql;
  create table WANT as
  select A.*,
         B.COUNT
  from   HAVE A
  left join (select count(*) as COUNT from HAVE group by STUDYID) B
  on     A.STUDYID=B.STUDYID;
quit;
Solution
‎02-07-2017 11:57 AM
SAS Employee
Posts: 20

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question Smiley Happy

 

Valued Guide
Posts: 505

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Posted in reply to MichaelLarsen
Adding the max date and count of dates to each observation for groups of studyids

see
https://goo.gl/010WH7
https://communities.sas.com/t5/Base-SAS-Programming/How-to-create-a-new-variable-based-on-the-max-value-of-a/m-p/330453

Assumes that Studyids are grouped.
This also maintains the order if that is important.
Should be the fastest because of 'locality of reference' and
cacheing.


HAVE
====

Obs    STUDYID    PERIOD     STARTDATE    GI_C

  1    097531        1      2007-02-01      0
  2    222482        1      2012-06-02      0
  3    092203        1      2010-05-07      1
  4    090624        1      2007-06-16      0
  5    118646        1      2007-04-28      0
  6    093507        1      2007-03-16      0
  7    080279        1      2007-04-18      0
  8    1045110       1      2007-05-11      0

  9    1633011       1      2007-08-01      0
 10    1633011       2      2008-02-21      0

 11    0607212       1      2008-02-06      1

 12    1932914       1      2007-01-31      0
 13    1932914       2      2009-02-11      0
 14    1932914       3      2009-05-19      0
 15    1932914       4      2009-09-22      0
 16    1932914       5      2010-01-09      0
 17    1932914       6      2010-02-23      0
 18    1932914       7      2010-12-01      0
 19    1932914       8      2011-05-10      0
 20    1932914       9      2013-08-03      0


WANT

Obs    STUDYID    PERIOD     STARTDATE    GI_C    COUNT

  1    0607212       1      2008-02-06      1       1
  2    080279        1      2007-04-18      0       1
  3    090624        1      2007-06-16      0       1
  4    092203        1      2010-05-07      1       1
  5    093507        1      2007-03-16      0       1
  6    097531        1      2007-02-01      0       1
  7    1045110       1      2007-05-11      0       1
  8    118646        1      2007-04-28      0       1

  9    1633011       2      2008-02-21      0       2 2 of these
 10    1633011       1      2007-08-01      0       2

 11    1932914       3      2009-05-19      0       9  9 of these
 12    1932914       4      2009-09-22      0       9
 13    1932914       2      2009-02-11      0       9
 14    1932914       1      2007-01-31      0       9
 15    1932914       9      2013-08-03      0       9
 16    1932914       8      2011-05-10      0       9
 17    1932914       7      2010-12-01      0       9
 18    1932914       6      2010-02-23      0       9
 19    1932914       5      2010-01-09      0       9

 20    222482        1      2012-06-02      0       1

* create some data;
data have;
INPUT
studyid $ Period StartDate MMDDYY10. GI_C ;
format Startdate yymmdd10.;
cards4;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;;;;
run;quit;


data want;
   retain maxdte cnt .;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     cnt=sum(cnt,1);
     if startdate > maxdte then maxdte=startdate;
   end;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     output;
   end;
   call missing(maxdte,cnt);
run;quit;


proc print data=want;
format startdate maxdte yymmdd10.;
run;quit;

Obs        MAXDTE    CNT    STUDYID    PERIOD     STARTDATE    GI_C

  1    2007-02-01     1     097531        1      2007-02-01      0
  2    2012-06-02     1     222482        1      2012-06-02      0
  3    2010-05-07     1     092203        1      2010-05-07      1
  4    2007-06-16     1     090624        1      2007-06-16      0
  5    2007-04-28     1     118646        1      2007-04-28      0
  6    2007-03-16     1     093507        1      2007-03-16      0
  7    2007-04-18     1     080279        1      2007-04-18      0
  8    2007-05-11     1     1045110       1      2007-05-11      0
  9    2008-02-21     2     1633011       1      2007-08-01      0
 10    2008-02-21     2     1633011       2      2008-02-21      0
 11    2008-02-06     1     0607212       1      2008-02-06      1
 12    2013-08-03     9     1932914       1      2007-01-31      0
 13    2013-08-03     9     1932914       2      2009-02-11      0
 14    2013-08-03     9     1932914       3      2009-05-19      0
 15    2013-08-03     9     1932914       4      2009-09-22      0
 16    2013-08-03     9     1932914       5      2010-01-09      0
 17    2013-08-03     9     1932914       6      2010-02-23      0
 18    2013-08-03     9     1932914       7      2010-12-01      0
 19    2013-08-03     9     1932914       8      2011-05-10      0
 20    2013-08-03     9     1932914       9      2013-08-03      0

Contributor
Posts: 51

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Posted in reply to MichaelLarsen
This code does EXACTLY what I needed it to do. Thank you so much!
Trusted Advisor
Posts: 1,019

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Your data appear to be grouped (but not sorted) by studyid.  And within studyid, it is sorted by period.  You can take advantage of that attribute with a "do until " loop with "set" and "by" statements inside the loop:

 

data want;
  do until (last.studyid);
    set temporary (rename=(period=periods_no));
    by studyid notsorted;
  end;
  do until (last.studyid);
    set temporary;
    by studyid notsorted;
    output;
  end;
run;

 

The first do group advances through a studyid to the final period.  With the PERIODS_NO in hand, the second do group re-reads the studyid and outputs each record.

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 233 views
  • 0 likes
  • 5 in conversation