Each subject in my dataset potentially has repeat measurements (where Period>1) and I am trying to figure out who has repeat measurements and how many measurements total each subject has. Some sample code is below (the dates are not reading in properly for some reason but I think the code still works for the purposes of this question):
data temporary;
INPUT
studyid$ Period StartDate MMDDYY10. GI_C ;
format Startdate yymmdd10.;
datalines;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;
run;
proc print data=temporary; run;
Subjects with only one measurement have only one row with Period=1. In the sample code above, while many people just have one period (Period=1) some have multiple periods such as Study ID 1633011 who has 2 periods and studyID 1932914 who has 9 periods. Is there a way to create a code that will identify and apply the highest value available in a column (the highest value of a variable) for a particular subject? I want to count how many periods are available for each person. I would like to create a new variable (Periods_no) that tells you how many total periods that subject has overall, as follows:
studyid Period StartDate GI_C Periods_no
097531 1 02/01/2007 0 1
222482 1 06/02/2012 0 1
092203 1 05/07/2010 1 1
090624 1 06/16/2007 0 1
118646 1 04/28/2007 0 1
093507 1 03/16/2007 0 1
080279 1 04/18/2007 0 1
1045110 1 05/11/2007 0 1
1633011 1 08/01/2007 0 2
1633011 2 02/21/2008 0 2
0607212 1 02/06/2008 1 1
1932914 1 01/31/2007 0 9
1932914 2 02/11/2009 0 9
1932914 3 05/19/2009 0 9
1932914 4 09/22/2009 0 9
1932914 5 01/09/2010 0 9
1932914 6 02/23/2010 0 9
1932914 7 12/01/2010 0 9
1932914 8 05/10/2011 0 9
1932914 9 08/03/2013 0 9
Thank you very much in advance.
How about this:
data temporary;
INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
format Startdate yymmdd10.;
datalines;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;
run;
proc sql;
create table temporary2 as
select a.*,
count(*) as Periods_no,
max(StartDate) as MaxStartDate format=yymmdd10.
from temporary a
group by studyid
order by studyid, StartDate;
quit;
proc print data=temporary2;
run;
I have added the highest StartDate within each StudyID, anticipating your next question
Why is subject number given column name studyid? Do you need the count in the data you have presented, or just as a summary? If its just a sequential count, then maybe:
proc sql; create table WANT as select A.*, B.COUNT from HAVE A left join (select count(*) as COUNT from HAVE group by STUDYID) B on A.STUDYID=B.STUDYID; quit;
How about this:
data temporary;
INPUT studyid$ Period StartDate : MMDDYY10. GI_C ;
format Startdate yymmdd10.;
datalines;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;
run;
proc sql;
create table temporary2 as
select a.*,
count(*) as Periods_no,
max(StartDate) as MaxStartDate format=yymmdd10.
from temporary a
group by studyid
order by studyid, StartDate;
quit;
proc print data=temporary2;
run;
I have added the highest StartDate within each StudyID, anticipating your next question
Adding the max date and count of dates to each observation for groups of studyids
see
https://goo.gl/010WH7
https://communities.sas.com/t5/Base-SAS-Programming/How-to-create-a-new-variable-based-on-the-max-value-of-a/m-p/330453
Assumes that Studyids are grouped.
This also maintains the order if that is important.
Should be the fastest because of 'locality of reference' and
cacheing.
HAVE
====
Obs STUDYID PERIOD STARTDATE GI_C
1 097531 1 2007-02-01 0
2 222482 1 2012-06-02 0
3 092203 1 2010-05-07 1
4 090624 1 2007-06-16 0
5 118646 1 2007-04-28 0
6 093507 1 2007-03-16 0
7 080279 1 2007-04-18 0
8 1045110 1 2007-05-11 0
9 1633011 1 2007-08-01 0
10 1633011 2 2008-02-21 0
11 0607212 1 2008-02-06 1
12 1932914 1 2007-01-31 0
13 1932914 2 2009-02-11 0
14 1932914 3 2009-05-19 0
15 1932914 4 2009-09-22 0
16 1932914 5 2010-01-09 0
17 1932914 6 2010-02-23 0
18 1932914 7 2010-12-01 0
19 1932914 8 2011-05-10 0
20 1932914 9 2013-08-03 0
WANT
Obs STUDYID PERIOD STARTDATE GI_C COUNT
1 0607212 1 2008-02-06 1 1
2 080279 1 2007-04-18 0 1
3 090624 1 2007-06-16 0 1
4 092203 1 2010-05-07 1 1
5 093507 1 2007-03-16 0 1
6 097531 1 2007-02-01 0 1
7 1045110 1 2007-05-11 0 1
8 118646 1 2007-04-28 0 1
9 1633011 2 2008-02-21 0 2 2 of these
10 1633011 1 2007-08-01 0 2
11 1932914 3 2009-05-19 0 9 9 of these
12 1932914 4 2009-09-22 0 9
13 1932914 2 2009-02-11 0 9
14 1932914 1 2007-01-31 0 9
15 1932914 9 2013-08-03 0 9
16 1932914 8 2011-05-10 0 9
17 1932914 7 2010-12-01 0 9
18 1932914 6 2010-02-23 0 9
19 1932914 5 2010-01-09 0 9
20 222482 1 2012-06-02 0 1
* create some data;
data have;
INPUT
studyid $ Period StartDate MMDDYY10. GI_C ;
format Startdate yymmdd10.;
cards4;
097531 1 02/01/2007 0
222482 1 06/02/2012 0
092203 1 05/07/2010 1
090624 1 06/16/2007 0
118646 1 04/28/2007 0
093507 1 03/16/2007 0
080279 1 04/18/2007 0
1045110 1 05/11/2007 0
1633011 1 08/01/2007 0
1633011 2 02/21/2008 0
0607212 1 02/06/2008 1
1932914 1 01/31/2007 0
1932914 2 02/11/2009 0
1932914 3 05/19/2009 0
1932914 4 09/22/2009 0
1932914 5 01/09/2010 0
1932914 6 02/23/2010 0
1932914 7 12/01/2010 0
1932914 8 05/10/2011 0
1932914 9 08/03/2013 0
;;;;
run;quit;
data want;
retain maxdte cnt .;
do until (last.studyid);
set have;
by studyid notsorted;
cnt=sum(cnt,1);
if startdate > maxdte then maxdte=startdate;
end;
do until (last.studyid);
set have;
by studyid notsorted;
output;
end;
call missing(maxdte,cnt);
run;quit;
proc print data=want;
format startdate maxdte yymmdd10.;
run;quit;
Obs MAXDTE CNT STUDYID PERIOD STARTDATE GI_C
1 2007-02-01 1 097531 1 2007-02-01 0
2 2012-06-02 1 222482 1 2012-06-02 0
3 2010-05-07 1 092203 1 2010-05-07 1
4 2007-06-16 1 090624 1 2007-06-16 0
5 2007-04-28 1 118646 1 2007-04-28 0
6 2007-03-16 1 093507 1 2007-03-16 0
7 2007-04-18 1 080279 1 2007-04-18 0
8 2007-05-11 1 1045110 1 2007-05-11 0
9 2008-02-21 2 1633011 1 2007-08-01 0
10 2008-02-21 2 1633011 2 2008-02-21 0
11 2008-02-06 1 0607212 1 2008-02-06 1
12 2013-08-03 9 1932914 1 2007-01-31 0
13 2013-08-03 9 1932914 2 2009-02-11 0
14 2013-08-03 9 1932914 3 2009-05-19 0
15 2013-08-03 9 1932914 4 2009-09-22 0
16 2013-08-03 9 1932914 5 2010-01-09 0
17 2013-08-03 9 1932914 6 2010-02-23 0
18 2013-08-03 9 1932914 7 2010-12-01 0
19 2013-08-03 9 1932914 8 2011-05-10 0
20 2013-08-03 9 1932914 9 2013-08-03 0
Your data appear to be grouped (but not sorted) by studyid. And within studyid, it is sorted by period. You can take advantage of that attribute with a "do until " loop with "set" and "by" statements inside the loop:
data want;
do until (last.studyid);
set temporary (rename=(period=periods_no));
by studyid notsorted;
end;
do until (last.studyid);
set temporary;
by studyid notsorted;
output;
end;
run;
The first do group advances through a studyid to the final period. With the PERIODS_NO in hand, the second do group re-reads the studyid and outputs each record.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.