SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Is there a code that will allow me to select first, second, and last dates for the same subject?

Accepted Solution Solved
Reply
Contributor
Posts: 51
Accepted Solution

Is there a code that will allow me to select first, second, and last dates for the same subject?

BACKGROUND INFO:

I have a database that shows periods of time (marked by StartDate and EndDate) during which study subjects were in a certain environment (SCHOOL vs. COMMUNITY). During these time periods, a lab test was done sporadically (sometimes multiple times but sometimes not at all) for which I have the dates (Lab_Date) and results (Lab_Result). The variable “count” keeps track of the fact that the same study subject can move back and forth between school and community multiple times (the first time period which is in the COMMUNITY in this case has count=1, the second time period SCHOOL has count=2, the third time period COMMUNITY has count=3, and then the next unique study subject starts at count=1 again).

 

Sample Data:

environment   StartDate   EndDate   SubjectNo   MvmtDate   count   daysbwn   short_inc   Lab_date   Lab_Result

COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29  1 . 6720

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 . 6720

SCHOOL      2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400

SCHOOL      2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-05-22 48

SCHOOL      2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-08-14

SCHOOL      2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-12-18 48

SCHOOL      2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 . 6720

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-07-21 159

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-10-09 68400

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-11-16 1650

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-12-29 2530

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-05 7700

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-11 21154

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-22 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-23 54

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-08-11 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-11-15 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-02-16 20

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-05-17 65

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-10-04 32

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-01-19 34

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 .  .

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-07-05 466

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410

 

MY QUESTION:

I want to simplify this database to the first and last lab_dates for each time period (dropping all others). The problem is that some periods having missing Lab_Dates, and I only want to drop these missing values in CERTAIN CIRCUMSTANCES. For example, for count=1 for the first subject, I want to keep that first row because the missing Lab_Date represents that a lab was never drawn during this time period (COMMUNITY) and I don't want to have to drop that whole time period (the lack of data is meaningful for analysis later on). However, for the second time period (SCHOOL; same subject, count=2), rows #2 and #3 tell you that while the Lab_Date is missing in row #2, this subject had multiple labs drawn during that time period that are captured by row #3 and beyond. So I want to keep the Lab_Date from row #3 and drop row #2 because it is unnecessary.

 

I know how to use first.StartDate to select the first row in each period:

data clean.firstlastvl;

         set work.allvl;

         by SubjectNo startdate;

         if first.startdate;

run;

 

And I think I was able to modify the code to select the first AND last rows for each period:

data clean.firstlastvl;

         set work.allvl;

         by SubjectNo startdate;

         if first.startdate or last.startdate;

run;

 

However, is there a way to select first, second, and last dates that fall into a certain time period? For example, I would like to make the sample dataset look like this (by dropping the rows marked in red):

 

environment   StartDate   EndDate   SubjectNo   MvmtDate   count   daysbwn   short_inc   Lab_date   Lab_Result

COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29 1 . 6720

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 . 6720

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-05-22 48

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-08-14

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-12-18 48

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 . 6720

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-07-21 159

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-10-09 68400

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-11-16 1650

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2009-12-29 2530

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-05 7700

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-02-11 21154

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-22 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-04-23 54

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-08-11 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2010-11-15 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-02-16 20

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-05-17 65

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2011-10-04 32

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-01-19 34

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 . .

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-07-05 466

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410

 

In a perfect world, it would actually look more like this (below), but I don’t see how SAS could be smart enough to know which missing values from Lab_Date to drop:

 

environment   StartDate   EndDate   SubjectNo   MvmtDate   count   daysbwn   short_inc   Lab_date   Lab_Result

COMMUNIT 2007-01-01 2007-01-30 00001 01/31/2007 1 29 1 . 6720

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2007-02-22 400

 

SCHOOL     2007-01-31 2008-04-30 00001 05/01/2008 2 455 0 2008-03-18 47

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2008-06-11 3545

 

COMMUNIT 2008-05-01 2015-01-01 00001 05/01/2008 3 2436 0 2012-03-31 72

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-03-08 .

COMMUNIT 2007-01-01 2012-05-31 00002 06/01/2012 1 1977 0 2007-08-17 410

 

My ultimate goal would be transpose the data wide, creating Lab_Date1, Lab_Date2, and Lab_Date3 that fit into the corresponding time periods (between StartDate and EndDate). Then if Lab_Date1 is missing, I would rename Lab_Date2 to Lab_Date1 and rename Lab_Date3 to Lab_Date2. I am hoping this would accurately capture the first and last non-missing Lab_Dates, if they do indeed exist, while providing a missing value for Lab_Date1 and Lab_Date2 if no labs were drawn during a certain time period.

 

Any help would be very much appreciated. I feel that I have been running in circles for a while now. Thank you in advance.

 


Accepted Solutions
Solution
‎11-30-2016 07:20 PM
Super User
Posts: 5,499

Re: Is there a code that will allow me to select first, second, and last dates for the same subject?

[ Edited ]

If you keep a missing value, will it always be the first observation of a group or could you keep a missing value that falls after a nonmissing value?

 

Assuming that the answer is "no", here's a possible approach to get started.

 

data missing;

set have;

by SubjectNo Count;

if first.count and lab_date=.;

run;

 

data nonmissing;

set have;

by SubjectNo Count;

where lab_date > .;

if first.count or last.count;

run;

 

Here's one way to proceed, that should at least come close to the requirements:

 

data want;

set missing nonmissing;

by SubjectNo Count;

if last.count=0 and Lab_date=. then delete;

run;

View solution in original post


All Replies
Super User
Posts: 7,768

Re: Is there a code that will allow me to select first, second, and last dates for the same subject?

When you can successfully identify the start of a group (as you define it), you can use that same condition to initialize a RETAINed counter variable to zero, which you can then use to identify the sequence number within a group.

if first.something
then do;
  ......
  counter = 0;
end;
counter + 1; * automatically RETAINS counter;
if counter = 2 ......
---------------------------------------------------------------------------------------------
Maxims of Maximally Efficient SAS Programmers
Solution
‎11-30-2016 07:20 PM
Super User
Posts: 5,499

Re: Is there a code that will allow me to select first, second, and last dates for the same subject?

[ Edited ]

If you keep a missing value, will it always be the first observation of a group or could you keep a missing value that falls after a nonmissing value?

 

Assuming that the answer is "no", here's a possible approach to get started.

 

data missing;

set have;

by SubjectNo Count;

if first.count and lab_date=.;

run;

 

data nonmissing;

set have;

by SubjectNo Count;

where lab_date > .;

if first.count or last.count;

run;

 

Here's one way to proceed, that should at least come close to the requirements:

 

data want;

set missing nonmissing;

by SubjectNo Count;

if last.count=0 and Lab_date=. then delete;

run;

Contributor
Posts: 51

Re: Is there a code that will allow me to select first, second, and last dates for the same subject?

Posted in reply to Astounding

Thank you both! I was able to piece together code from both of your responses to get exactly what I wanted, although I likely generated more code than necessary. I created a new counter variable to fit my needs and then used the original count variable and new counter variable (counter_lab) to select the first and last nonmissing lab values while leaving missing lab values when appropriate (i.e., when no lab values were available for those time periods). Here is my code for reference. Please feel free to send additional suggestions if you notice potential errors or other issues.

 

*Add a count to the Lab Dates*;

proc sort data=clean.firstlab by subjectID startdate count; run;

 

 

data clean.firstlab2;

  set clean.firstlab;

  retain counter_lab;

  by subjectID count;

  if first.count then counter_lab = 0 ;

  counter_lab= counter_lab + 1;

run;

proc print data = clean.firstlab2 (obs=50); run;

 

proc sort data=clean.firstlab2; by subjectID Counter_lab Lab_date; run;

 

*Select the desired first VL_dates*;

data nonmissingfirst;

set clean.firstlab2;

by subjectid Counter_lab;

where lab_date > .;

if counter_lab=1 or counter_lab=2;

run;

proc print data = nonmissingfirst (obs=50); run;

 

*Select the LAST VL_dates, this time using COUNT instead of COUNTER_VL. This will include real missing VL_dates.

Start by sorting, then later resort so it can be merged with the other two databases*;

proc sort data=clean.firstlab2; by subjectid count lab_date; run;

data nonmissinglast;

set clean.firstlab2;

by subjectid Count;

if last.count;

run;

proc print data = nonmissinglast (obs=50); run;

proc sort data=nonmissinglast; by subjectid counter_lab lab_date; run;

 

*Modified from the code you sent titled "Here's one way to proceed, that should at least come close to the requirements" *;

data want;

set nonmissingfirst nonmissinglast;

by subjectid Counter_lab;

run;

proc sort data = work.want ; by subjectid startdate counter_lab count lab_date; run;

proc print data = work.want (obs=50); run;

 

*Locate duplicate observations and check if they make sense based on how the lab values were selected and merged*; 

proc print data=clean.firstLab2;

WHERE subjectID="x" or subjectID="y" or subjectID="z" ;

run;

 

*Remove duplicates and make a permanent dataset*;

PROC SORT DATA=WANT OUT=CLEAN.FIRSTLASTLab NODUPLICATES ;

  BY SubjectID StartDate EndDate environment count counter_lab lab_date ;

RUN ;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 3 replies
  • 252 views
  • 1 like
  • 3 in conversation