BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kels123
Quartz | Level 8

Each subject in my dataset potentially has repeat measurements (where Period>1) and I am trying to figure out who has repeat measurements and how many measurements total each subject has. Some sample code is below (the dates are not reading in properly for some reason but I think the code still works for the purposes of this question):

 

data temporary;

INPUT 

studyid$ Period StartDate MMDDYY10. GI_C ;

format Startdate yymmdd10.;

datalines;

097531  1         02/01/2007        0

222482  1         06/02/2012        0

092203  1         05/07/2010        1

090624  1         06/16/2007        0

118646  1         04/28/2007        0

093507  1         03/16/2007        0

080279  1         04/18/2007        0

1045110  1        05/11/2007        0

1633011  1        08/01/2007        0

1633011  2        02/21/2008        0

0607212  1        02/06/2008        1

1932914  1        01/31/2007        0

1932914  2        02/11/2009        0

1932914  3        05/19/2009        0

1932914  4        09/22/2009        0

1932914  5        01/09/2010        0

1932914  6        02/23/2010        0

1932914  7        12/01/2010        0

1932914  8        05/10/2011        0

1932914  9        08/03/2013        0

;

run;

proc print data=temporary; run;

 

Subjects with only one measurement have only one row with Period=1. In the sample code above, while many people just have one period (Period=1) some have multiple periods such as Study ID 1633011 who has 2 periods and studyID 1932914 who has 9 periods. Is there a way to create a code that will identify and apply the highest value available in a column (the highest value of a variable) for a particular subject? I want to count how many periods are available for each person.  I would like to create a new variable (Periods_no) that tells you how many total periods that subject has overall, as follows:

 

studyid  Period   StartDate           GI_C  Periods_no

097531  1          02/01/2007         0                      1

222482  1          06/02/2012         0                      1

092203  1          05/07/2010         1                      1

090624  1          06/16/2007         0                      1

118646  1          04/28/2007         0                      1

093507  1          03/16/2007         0                      1

080279  1          04/18/2007         0                      1

1045110 1          05/11/2007         0                      1

1633011 1          08/01/2007         0                      2

1633011 2          02/21/2008         0                      2

0607212 1          02/06/2008         1                      1

1932914 1          01/31/2007         0                      9

1932914 2          02/11/2009         0                      9

1932914 3          05/19/2009         0                      9

1932914 4          09/22/2009         0                      9

1932914 5          01/09/2010         0                      9

1932914 6          02/23/2010         0                      9

1932914 7          12/01/2010         0                      9

1932914 8          05/10/2011         0                      9

1932914 9          08/03/2013         0                      9

 

Thank you very much in advance.

 

1 ACCEPTED SOLUTION

Accepted Solutions
MichaelLarsen
SAS Employee

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question Smiley Happy

 

View solution in original post

5 REPLIES 5
RW9
Diamond | Level 26 RW9
Diamond | Level 26

Why is subject number given column name studyid?  Do you need the count in the data you have presented, or just as a summary?  If its just a sequential count, then maybe:

proc sql;
  create table WANT as
  select A.*,
         B.COUNT
  from   HAVE A
  left join (select count(*) as COUNT from HAVE group by STUDYID) B
  on     A.STUDYID=B.STUDYID;
quit;
MichaelLarsen
SAS Employee

How about this:

data temporary;
	INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
	format Startdate yymmdd10.;
datalines;
097531  1         02/01/2007        0
222482  1         06/02/2012        0
092203  1         05/07/2010        1
090624  1         06/16/2007        0
118646  1         04/28/2007        0
093507  1         03/16/2007        0
080279  1         04/18/2007        0
1045110  1        05/11/2007        0
1633011  1        08/01/2007        0
1633011  2        02/21/2008        0
0607212  1        02/06/2008        1
1932914  1        01/31/2007        0
1932914  2        02/11/2009        0
1932914  3        05/19/2009        0
1932914  4        09/22/2009        0
1932914  5        01/09/2010        0
1932914  6        02/23/2010        0
1932914  7        12/01/2010        0
1932914  8        05/10/2011        0
1932914  9        08/03/2013        0
;
run;
proc sql;
  create table temporary2 as
	  select a.*,
		    count(*) as Periods_no,
				max(StartDate) as MaxStartDate format=yymmdd10.
		from temporary a
		group by studyid 
    order by studyid, StartDate;
quit;

proc print data=temporary2; 
run;

I have added the highest StartDate within each StudyID, anticipating your next question Smiley Happy

 

rogerjdeangelis
Barite | Level 11
Adding the max date and count of dates to each observation for groups of studyids

see
https://goo.gl/010WH7
https://communities.sas.com/t5/Base-SAS-Programming/How-to-create-a-new-variable-based-on-the-max-value-of-a/m-p/330453

Assumes that Studyids are grouped.
This also maintains the order if that is important.
Should be the fastest because of 'locality of reference' and
cacheing.


HAVE
====

Obs    STUDYID    PERIOD     STARTDATE    GI_C

  1    097531        1      2007-02-01      0
  2    222482        1      2012-06-02      0
  3    092203        1      2010-05-07      1
  4    090624        1      2007-06-16      0
  5    118646        1      2007-04-28      0
  6    093507        1      2007-03-16      0
  7    080279        1      2007-04-18      0
  8    1045110       1      2007-05-11      0

  9    1633011       1      2007-08-01      0
 10    1633011       2      2008-02-21      0

 11    0607212       1      2008-02-06      1

 12    1932914       1      2007-01-31      0
 13    1932914       2      2009-02-11      0
 14    1932914       3      2009-05-19      0
 15    1932914       4      2009-09-22      0
 16    1932914       5      2010-01-09      0
 17    1932914       6      2010-02-23      0
 18    1932914       7      2010-12-01      0
 19    1932914       8      2011-05-10      0
 20    1932914       9      2013-08-03      0


WANT

Obs    STUDYID    PERIOD     STARTDATE    GI_C    COUNT

  1    0607212       1      2008-02-06      1       1
  2    080279        1      2007-04-18      0       1
  3    090624        1      2007-06-16      0       1
  4    092203        1      2010-05-07      1       1
  5    093507        1      2007-03-16      0       1
  6    097531        1      2007-02-01      0       1
  7    1045110       1      2007-05-11      0       1
  8    118646        1      2007-04-28      0       1

  9    1633011       2      2008-02-21      0       2 2 of these
 10    1633011       1      2007-08-01      0       2

 11    1932914       3      2009-05-19      0       9  9 of these
 12    1932914       4      2009-09-22      0       9
 13    1932914       2      2009-02-11      0       9
 14    1932914       1      2007-01-31      0       9
 15    1932914       9      2013-08-03      0       9
 16    1932914       8      2011-05-10      0       9
 17    1932914       7      2010-12-01      0       9
 18    1932914       6      2010-02-23      0       9
 19    1932914       5      2010-01-09      0       9

 20    222482        1      2012-06-02      0       1

* create some data;
data have;
INPUT
studyid $ Period StartDate MMDDYY10. GI_C ;
format Startdate yymmdd10.;
cards4;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;;;;
run;quit;


data want;
   retain maxdte cnt .;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     cnt=sum(cnt,1);
     if startdate > maxdte then maxdte=startdate;
   end;
   do until (last.studyid);
     set have;
     by studyid notsorted;
     output;
   end;
   call missing(maxdte,cnt);
run;quit;


proc print data=want;
format startdate maxdte yymmdd10.;
run;quit;

Obs        MAXDTE    CNT    STUDYID    PERIOD     STARTDATE    GI_C

  1    2007-02-01     1     097531        1      2007-02-01      0
  2    2012-06-02     1     222482        1      2012-06-02      0
  3    2010-05-07     1     092203        1      2010-05-07      1
  4    2007-06-16     1     090624        1      2007-06-16      0
  5    2007-04-28     1     118646        1      2007-04-28      0
  6    2007-03-16     1     093507        1      2007-03-16      0
  7    2007-04-18     1     080279        1      2007-04-18      0
  8    2007-05-11     1     1045110       1      2007-05-11      0
  9    2008-02-21     2     1633011       1      2007-08-01      0
 10    2008-02-21     2     1633011       2      2008-02-21      0
 11    2008-02-06     1     0607212       1      2008-02-06      1
 12    2013-08-03     9     1932914       1      2007-01-31      0
 13    2013-08-03     9     1932914       2      2009-02-11      0
 14    2013-08-03     9     1932914       3      2009-05-19      0
 15    2013-08-03     9     1932914       4      2009-09-22      0
 16    2013-08-03     9     1932914       5      2010-01-09      0
 17    2013-08-03     9     1932914       6      2010-02-23      0
 18    2013-08-03     9     1932914       7      2010-12-01      0
 19    2013-08-03     9     1932914       8      2011-05-10      0
 20    2013-08-03     9     1932914       9      2013-08-03      0

Kels123
Quartz | Level 8
This code does EXACTLY what I needed it to do. Thank you so much!
mkeintz
PROC Star

Your data appear to be grouped (but not sorted) by studyid.  And within studyid, it is sorted by period.  You can take advantage of that attribute with a "do until " loop with "set" and "by" statements inside the loop:

 

data want;
  do until (last.studyid);
    set temporary (rename=(period=periods_no));
    by studyid notsorted;
  end;
  do until (last.studyid);
    set temporary;
    by studyid notsorted;
    output;
  end;
run;

 

The first do group advances through a studyid to the final period.  With the PERIODS_NO in hand, the second do group re-reads the studyid and outputs each record.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

SAS Innovate 2025: Register Today!

 

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 1560 views
  • 0 likes
  • 5 in conversation