BACKGROUND INFO:
I have a database that shows periods of time (marked by StartDate and EndDate) during which study subjects were in a certain environment (SCHOOL vs. COMMUNITY). During these time periods, a lab test was done sporadically (sometimes multiple times but sometimes not at all) for which I have the dates (Lab_Date) and results (Lab_Result). The variable “count” keeps track of the fact that the same study subject can move back and forth between school and community multiple times (the first time period which is in the COMMUNITY in this case has count=1, the second time period SCHOOL has count=2, the third time period COMMUNITY has count=3, and then the next unique study subject starts at count=1 again).
Sample Data:
environment StartDate EndDate SubjectNo MvmtDate count daysbwn short_inc Lab_date Lab_Result
COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29 1 . 6720
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 . 6720
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-05-22 48
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-08-14
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-12-18 48
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 . 6720
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-07-21 159
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-10-09 68400
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-11-16 1650
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-12-29 2530
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-05 7700
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-11 21154
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-22 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-23 54
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-08-11 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-11-15 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-02-16 20
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-05-17 65
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-10-04 32
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-01-19 34
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 . .
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-07-05 466
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410
MY QUESTION:
I want to simplify this database to the first and last lab_dates for each time period (dropping all others). The problem is that some periods having missing Lab_Dates, and I only want to drop these missing values in CERTAIN CIRCUMSTANCES. For example, for count=1 for the first subject, I want to keep that first row because the missing Lab_Date represents that a lab was never drawn during this time period (COMMUNITY) and I don't want to have to drop that whole time period (the lack of data is meaningful for analysis later on). However, for the second time period (SCHOOL; same subject, count=2), rows #2 and #3 tell you that while the Lab_Date is missing in row #2, this subject had multiple labs drawn during that time period that are captured by row #3 and beyond. So I want to keep the Lab_Date from row #3 and drop row #2 because it is unnecessary.
I know how to use first.StartDate to select the first row in each period:
data clean.firstlastvl;
set work.allvl;
by SubjectNo startdate;
if first.startdate;
run;
And I think I was able to modify the code to select the first AND last rows for each period:
data clean.firstlastvl;
set work.allvl;
by SubjectNo startdate;
if first.startdate or last.startdate;
run;
However, is there a way to select first, second, and last dates that fall into a certain time period? For example, I would like to make the sample dataset look like this (by dropping the rows marked in red):
environment StartDate EndDate SubjectNo MvmtDate count daysbwn short_inc Lab_date Lab_Result
COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29 1 . 6720
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 . 6720
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-05-22 48
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-08-14
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-12-18 48
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 . 6720
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-07-21 159
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-10-09 68400
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-11-16 1650
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-12-29 2530
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-05 7700
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-11 21154
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-22 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-23 54
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-08-11 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-11-15 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-02-16 20
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-05-17 65
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-10-04 32
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-01-19 34
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 . .
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-07-05 466
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410
In a perfect world, it would actually look more like this (below), but I don’t see how SAS could be smart enough to know which missing values from Lab_Date to drop:
environment StartDate EndDate SubjectNo MvmtDate count daysbwn short_inc Lab_date Lab_Result
COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29 1 . 6720
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400
SCHOOL 2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545
COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .
COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410
My ultimate goal would be transpose the data wide, creating Lab_Date1, Lab_Date2, and Lab_Date3 that fit into the corresponding time periods (between StartDate and EndDate). Then if Lab_Date1 is missing, I would rename Lab_Date2 to Lab_Date1 and rename Lab_Date3 to Lab_Date2. I am hoping this would accurately capture the first and last non-missing Lab_Dates, if they do indeed exist, while providing a missing value for Lab_Date1 and Lab_Date2 if no labs were drawn during a certain time period.
Any help would be very much appreciated. I feel that I have been running in circles for a while now. Thank you in advance.
If you keep a missing value, will it always be the first observation of a group or could you keep a missing value that falls after a nonmissing value?
Assuming that the answer is "no", here's a possible approach to get started.
data missing;
set have;
by SubjectNo Count;
if first.count and lab_date=.;
run;
data nonmissing;
set have;
by SubjectNo Count;
where lab_date > .;
if first.count or last.count;
run;
Here's one way to proceed, that should at least come close to the requirements:
data want;
set missing nonmissing;
by SubjectNo Count;
if last.count=0 and Lab_date=. then delete;
run;
When you can successfully identify the start of a group (as you define it), you can use that same condition to initialize a RETAINed counter variable to zero, which you can then use to identify the sequence number within a group.
if first.something
then do;
......
counter = 0;
end;
counter + 1; * automatically RETAINS counter;
if counter = 2 ......
If you keep a missing value, will it always be the first observation of a group or could you keep a missing value that falls after a nonmissing value?
Assuming that the answer is "no", here's a possible approach to get started.
data missing;
set have;
by SubjectNo Count;
if first.count and lab_date=.;
run;
data nonmissing;
set have;
by SubjectNo Count;
where lab_date > .;
if first.count or last.count;
run;
Here's one way to proceed, that should at least come close to the requirements:
data want;
set missing nonmissing;
by SubjectNo Count;
if last.count=0 and Lab_date=. then delete;
run;
Thank you both! I was able to piece together code from both of your responses to get exactly what I wanted, although I likely generated more code than necessary. I created a new counter variable to fit my needs and then used the original count variable and new counter variable (counter_lab) to select the first and last nonmissing lab values while leaving missing lab values when appropriate (i.e., when no lab values were available for those time periods). Here is my code for reference. Please feel free to send additional suggestions if you notice potential errors or other issues.
*Add a count to the Lab Dates*;
proc sort data=clean.firstlab by subjectID startdate count; run;
data clean.firstlab2;
set clean.firstlab;
retain counter_lab;
by subjectID count;
if first.count then counter_lab = 0 ;
counter_lab= counter_lab + 1;
run;
proc print data = clean.firstlab2 (obs=50); run;
proc sort data=clean.firstlab2; by subjectID Counter_lab Lab_date; run;
*Select the desired first VL_dates*;
data nonmissingfirst;
set clean.firstlab2;
by subjectid Counter_lab;
where lab_date > .;
if counter_lab=1 or counter_lab=2;
run;
proc print data = nonmissingfirst (obs=50); run;
*Select the LAST VL_dates, this time using COUNT instead of COUNTER_VL. This will include real missing VL_dates.
Start by sorting, then later resort so it can be merged with the other two databases*;
proc sort data=clean.firstlab2; by subjectid count lab_date; run;
data nonmissinglast;
set clean.firstlab2;
by subjectid Count;
if last.count;
run;
proc print data = nonmissinglast (obs=50); run;
proc sort data=nonmissinglast; by subjectid counter_lab lab_date; run;
*Modified from the code you sent titled "Here's one way to proceed, that should at least come close to the requirements" *;
data want;
set nonmissingfirst nonmissinglast;
by subjectid Counter_lab;
run;
proc sort data = work.want ; by subjectid startdate counter_lab count lab_date; run;
proc print data = work.want (obs=50); run;
*Locate duplicate observations and check if they make sense based on how the lab values were selected and merged*;
proc print data=clean.firstLab2;
WHERE subjectID="x" or subjectID="y" or subjectID="z" ;
run;
*Remove duplicates and make a permanent dataset*;
PROC SORT DATA=WANT OUT=CLEAN.FIRSTLASTLab NODUPLICATES ;
BY SubjectID StartDate EndDate environment count counter_lab lab_date ;
RUN ;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.