Solved: How to create a new variable based on the max value of a repeated meas...

Kels123 · Posted 02-07-2017 08:07 AM

Each subject in my dataset potentially has repeat measurements (where Period>1) and I am trying to figure out who has repeat measurements and how many measurements total each subject has. Some sample code is below (the dates are not reading in properly for some reason but I think the code still works for the purposes of this question):

data temporary;

INPUT

studyid$ Period StartDate MMDDYY10. GI_C ;

format Startdate yymmdd10.;

datalines;

097531 1 02/01/2007 0

222482 1 06/02/2012 0

092203 1 05/07/2010 1

090624 1 06/16/2007 0

118646 1 04/28/2007 0

093507 1 03/16/2007 0

080279 1 04/18/2007 0

1045110 1 05/11/2007 0

1633011 1 08/01/2007 0

1633011 2 02/21/2008 0

0607212 1 02/06/2008 1

1932914 1 01/31/2007 0

1932914 2 02/11/2009 0

1932914 3 05/19/2009 0

1932914 4 09/22/2009 0

1932914 5 01/09/2010 0

1932914 6 02/23/2010 0

1932914 7 12/01/2010 0

1932914 8 05/10/2011 0

1932914 9 08/03/2013 0

;

run;

proc print data=temporary; run;

Subjects with only one measurement have only one row with Period=1. In the sample code above, while many people just have one period (Period=1) some have multiple periods such as Study ID 1633011 who has 2 periods and studyID 1932914 who has 9 periods. Is there a way to create a code that will identify and apply the highest value available in a column (the highest value of a variable) for a particular subject? I want to count how many periods are available for each person. I would like to create a new variable (Periods_no) that tells you how many total periods that subject has overall, as follows:

studyid Period StartDate GI_C Periods_no

097531 1 02/01/2007 0 1

222482 1 06/02/2012 0 1

092203 1 05/07/2010 1 1

090624 1 06/16/2007 0 1

118646 1 04/28/2007 0 1

093507 1 03/16/2007 0 1

080279 1 04/18/2007 0 1

1045110 1 05/11/2007 0 1

1633011 1 08/01/2007 0 2

1633011 2 02/21/2008 0 2

0607212 1 02/06/2008 1 1

1932914 1 01/31/2007 0 9

1932914 2 02/11/2009 0 9

1932914 3 05/19/2009 0 9

1932914 4 09/22/2009 0 9

1932914 5 01/09/2010 0 9

1932914 6 02/23/2010 0 9

1932914 7 12/01/2010 0 9

1932914 8 05/10/2011 0 9

1932914 9 08/03/2013 0 9

Thank you very much in advance.

MichaelLarsen · Posted 02-07-2017 08:25 AM

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question

View solution in original post

RW9 · Posted 02-07-2017 08:20 AM

Why is subject number given column name studyid? Do you need the count in the data you have presented, or just as a summary? If its just a sequential count, then maybe:

proc sql;
  create table WANT as
  select A.*,
         B.COUNT
  from   HAVE A
  left join (select count(*) as COUNT from HAVE group by STUDYID) B
  on     A.STUDYID=B.STUDYID;
quit;

MichaelLarsen · Posted 02-07-2017 08:25 AM

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question

rogerjdeangelis · Posted 02-07-2017 10:28 AM

Adding the max date and count of dates to each observation for groups of studyids

see
https://goo.gl/010WH7
https://communities.sas.com/t5/Base-SAS-Programming/How-to-create-a-new-variable-based-on-the-max-value-of-a/m-p/330453

Assumes that Studyids are grouped.
This also maintains the order if that is important.
Should be the fastest because of 'locality of reference' and
cacheing.


HAVE
====

Obs    STUDYID    PERIOD     STARTDATE    GI_C

  1    097531        1      2007-02-01      0
  2    222482        1      2012-06-02      0
  3    092203        1      2010-05-07      1
  4    090624        1      2007-06-16      0
  5    118646        1      2007-04-28      0
  6    093507        1      2007-03-16      0
  7    080279        1      2007-04-18      0
  8    1045110       1      2007-05-11      0

  9    1633011       1      2007-08-01      0
 10    1633011       2      2008-02-21      0

 11    0607212       1      2008-02-06      1

 12    1932914       1      2007-01-31      0
 13    1932914       2      2009-02-11      0
 14    1932914       3      2009-05-19      0
 15    1932914       4      2009-09-22      0
 16    1932914       5      2010-01-09      0
 17    1932914       6      2010-02-23      0
 18    1932914       7      2010-12-01      0
 19    1932914       8      2011-05-10      0
 20    1932914       9      2013-08-03      0


WANT

Obs    STUDYID    PERIOD     STARTDATE    GI_C    COUNT

  1    0607212       1      2008-02-06      1       1
  2    080279        1      2007-04-18      0       1
  3    090624        1      2007-06-16      0       1
  4    092203        1      2010-05-07      1       1
  5    093507        1      2007-03-16      0       1
  6    097531        1      2007-02-01      0       1
  7    1045110       1      2007-05-11      0       1
  8    118646        1      2007-04-28      0       1

  9    1633011       2      2008-02-21      0       2 2 of these
 10    1633011       1      2007-08-01      0       2

 11    1932914       3      2009-05-19      0       9  9 of these
 12    1932914       4      2009-09-22      0       9
 13    1932914       2      2009-02-11      0       9
 14    1932914       1      2007-01-31      0       9
 15    1932914       9      2013-08-03      0       9
 16    1932914       8      2011-05-10      0       9
 17    1932914       7      2010-12-01      0       9
 18    1932914       6      2010-02-23      0       9
 19    1932914       5      2010-01-09      0       9

 20    222482        1      2012-06-02      0       1

* create some data;
data have;
INPUT
studyid $ Period StartDate MMDDYY10. GI_C ;
format Startdate yymmdd10.;
cards4;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;;;;
run;quit;


data want;
   retain maxdte cnt .;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     cnt=sum(cnt,1);
     if startdate > maxdte then maxdte=startdate;
   end;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     output;
   end;
   call missing(maxdte,cnt);
run;quit;


proc print data=want;
format startdate maxdte yymmdd10.;
run;quit;

Obs        MAXDTE    CNT    STUDYID    PERIOD     STARTDATE    GI_C

  1    2007-02-01     1     097531        1      2007-02-01      0
  2    2012-06-02     1     222482        1      2012-06-02      0
  3    2010-05-07     1     092203        1      2010-05-07      1
  4    2007-06-16     1     090624        1      2007-06-16      0
  5    2007-04-28     1     118646        1      2007-04-28      0
  6    2007-03-16     1     093507        1      2007-03-16      0
  7    2007-04-18     1     080279        1      2007-04-18      0
  8    2007-05-11     1     1045110       1      2007-05-11      0
  9    2008-02-21     2     1633011       1      2007-08-01      0
 10    2008-02-21     2     1633011       2      2008-02-21      0
 11    2008-02-06     1     0607212       1      2008-02-06      1
 12    2013-08-03     9     1932914       1      2007-01-31      0
 13    2013-08-03     9     1932914       2      2009-02-11      0
 14    2013-08-03     9     1932914       3      2009-05-19      0
 15    2013-08-03     9     1932914       4      2009-09-22      0
 16    2013-08-03     9     1932914       5      2010-01-09      0
 17    2013-08-03     9     1932914       6      2010-02-23      0
 18    2013-08-03     9     1932914       7      2010-12-01      0
 19    2013-08-03     9     1932914       8      2011-05-10      0
 20    2013-08-03     9     1932914       9      2013-08-03      0

Kels123 · Posted 02-07-2017 11:57 AM

This code does EXACTLY what I needed it to do. Thank you so much!

mkeintz · Posted 02-07-2017 10:28 AM

Your data appear to be grouped (but not sorted) by studyid. And within studyid, it is sorted by period. You can take advantage of that attribute with a "do until " loop with "set" and "by" statements inside the loop:

data want;
  do until (last.studyid);
    set temporary (rename=(period=periods_no));
    by studyid notsorted;
  end;
  do until (last.studyid);
    set temporary;
    by studyid notsorted;
    output;
  end;
run;

The first do group advances through a studyid to the final period. With the PERIODS_NO in hand, the second do group re-reads the studyid and outputs each record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Re: How to create a new variable based on the max value of a repeated measurement for each a subject

Registration is open

SAS Training: Just a Click Away