Solved: Re: How do I combine time periods based on date and environment?

Kels123 · Posted 12-04-2016 01:14 PM

I am interested in studying subjects in their home environment. I have a database of time periods that looks like this (sample data):

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

I simplified it to look like this because I decided that periods of time at home that last less than one full day (daysbwn=0) are not meaningful to this study:

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

Now I would like to effectively combine periods at school, whenever possible so that I can use one big period of time as opposed to periods of time at school fragmented by these <1 day periods of time at home. Basically, I would like to create a database that looks like this:

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

The part I am struggling with is how to combine the time periods like this while ended up with the correct environment, StartDate, and EndDate variables. (MvmtDate is more or less irrelevant but I have left it in the data and sample data for tracking purposes.)

Thanks very much in advance!

Patrick · Posted 12-04-2016 09:13 PM

Try this:


proc sort data=have out=want;
  by studyID StartDate EndDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  retain _r_StartDate _r_EndDate;
  _r_StartDate=min(StartDate,_r_StartDate);
  _r_EndDate=max(EndDate,_r_EndDate);

  if lastObs or last.environment then _outFlg=1;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
    end;

  if _outFlg=1 then
    do;
      StartDate=_r_StartDate;
      EndDate=_r_EndDate;
      output;
      call missing(_r_StartDate, _r_EndDate);
    end;
run;

View solution in original post

Shmuel · Posted 12-04-2016 04:43 PM

In the next HOME lines the strat_date is greater then the end_date:

HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1

HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1

HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1

that gives the idea to filter lines where start_date > end_date.

The next step is to check:

is envirinment and studyID are same as on previous line and start_date - lag(end_dtae) in (0, 1)

then combine the two periods into one big period - BUT:

How to treat the other variables: count daysbwn short_period - use: min / max/ sum / first / last or other function ?

The daysbwn - seems you have SUMed.

The count - seems you saved the first or the MIN value or calulated ?

Is short_period - always = 0 ?

Try next code:

data temp;

set have;

if start_date > end_date then delete;

run;

proc sort data=temp; by environment studyID StartDate ; run;

data want(keep= env stdID StartDT daysTot cnt

rename = (env = environment stdID = studyID startDT = start_date end_dt = end_date

days_tot = daysbwn cnt = count));

set temp;

by environment studyID ;

retain env stdID StartDT endDT daysTot cnt;

if first.studyID then do;

env = environment ;

stdID = studyID ;

startDT = start_date;

endDT = end_date;

daysTot = daysbwn;

If env = "SCHOOL" then cnt = 1; /* this line was edited */

end;

else do;

if start_date - endDT in (0, 1) then do;

endDT = end_date;

daysTot = sum(of daysTot, daysbwn);

end; else do;

output;

startDT = start_date;

endDT = end_date;

daysTot = daysbwn;

cnt = cnt + 1;

end;

if last.studyID then output;

run;

Patrick · Posted 12-04-2016 05:07 PM

Please provide in the future a SAS datastep creating sample data so that we don't have to do the work.

Below code selects the start and end dates and only outputs a single record for consecutive dates. I haven't added the calculations for the other variables but this shouldn't be too hard to add.

data have;
  infile datalines truncover dlm=' ';
  input 
    environment :$9.
    (StartDate EndDate) (:yymmdd.)
    studyID 
    MvmtDate :mmddyy.
    count 
    daysbwn 
    short_period;

  format
    StartDate EndDate MvmtDate date9.;

  datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
;
run;

proc sort data=have out=want;
  by studyID StartDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  if lastObs or last.environment then output;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) ne 1 then output;
    end;
run;

Kels123 · Posted 12-04-2016 08:07 PM

Thank you so much for sending this code (sorry for creating more work for you than necessary)! It is almost exactly what I need. The only issue is that the StartDate is not always accurate in that it does not automatically include the very first StartDate of the period. For example, if this is the original output (desired StartDate in orange, desired EndDate in blue)...

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

Instead of this...

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

I get this (dates in red represent the problematic start dates):

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 1 2170 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

Do you have any additional advice on how to potentially modify your code to address this? Thank you again, in advance.

Patrick · Posted 12-04-2016 09:13 PM

Try this:


proc sort data=have out=want;
  by studyID StartDate EndDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  retain _r_StartDate _r_EndDate;
  _r_StartDate=min(StartDate,_r_StartDate);
  _r_EndDate=max(EndDate,_r_EndDate);

  if lastObs or last.environment then _outFlg=1;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
    end;

  if _outFlg=1 then
    do;
      StartDate=_r_StartDate;
      EndDate=_r_EndDate;
      output;
      call missing(_r_StartDate, _r_EndDate);
    end;
run;

Kels123 · Posted 01-07-2017 07:34 PM

Hi,

I just noticed that the codes provided earlier on work for almost all scenarios, but I found a few exceptions shown as sample data below where the code did not combine all periods. (This is probably the first period's enddate is not immediately adjacent to the next period's startdate; there is a day missing in between the two periods because at least one time period <= 24 hours at SCHOOL was dropped. This is fine, but means that I might need a different code to combine these periods.)

data tempfile;

infile datalines truncover;

INPUT

environment$ StartDate yymmdd10.@+1 EndDate yymmdd10.@+1

studyID$4.@+1 MvmtDate mmddyy10.@+1

Movementcode Movecode daysbwn Movementcode2 Movecode2 count ;

format Startdate EndDate MvmtDate yymmdd10.;

datalines;

SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1

HOME 2007-01-10 2008-12-01 0001 12/01/2008 1 1 691 3 2 2

HOME 2008-12-03 2009-01-02 0001 01/02/2009 1 1 30 8 2 4

SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5

HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1

SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2

HOME 2008-02-23 2010-06-11 0002 06/11/2010 1 1 839 3 2 3

HOME 2010-06-13 2015-01-01 0002 06/12/2010 8 2 1663 8 2 5

;

run;

proc print data=tempfile; run;

I would like for the adjacent time periods that take place in the same environment (in red text) to be combined to look like this:

SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1

HOME 2007-01-10 2009-01-02 0001 01/02/2009 1 1 721 3 2 2

SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5

HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1

SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2

HOME 2008-02-23 2015-01-01 0002 06/12/2010 1 1 2502 3 2 3

Specifically (and most importantly), I would like to combine those periods using the startdate from the first period and the enddate from the last period.

If possible, I would also like the Movementcode2 and Movecode2 variables from the first period to be the ones applied to the combined period.

Less importantly are the daysbwn and count variables because I can easily just recode those myself. Movementcode and Movecode can be dropped.

Is there a code that can help me do this systematically and properly? It could be a modification to the code provided previously, or a second code I can apply after the initial code is applied.

Thank you very much in advance,

Kelsey

Kels123 · Posted 01-07-2017 10:25 PM

In response to my recent follow-up question, it looks like the code previously provided by Shmuel will do the trick, with one small modification:

data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;

/*

data temp;
set have;
if StartDate > EndDate then delete; /* ignore "invalid" periods */
run;

*/

*I preferred using this code to do the initial drop of invalid periods, but it should more or less do the same thing as the commented code above*;

data temp;

set have

where daysbwn>0; /* ignore "invalid" periods */

run;

*IMPORTANT CODE STARTS HERE*;

proc sort data=temp; by studyID StartDate ; run;

data want(keep= env stdID StartDT endDT daysTot cnt shortPD
rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
daystot = daysbwn cnt = count shortPD = short_period));
set temp;
by studyID;
retain env stdID StartDT endDT daysTot cnt shortPD;

format startDT endDT ddmmyy10.;

if first.studyID then do;
env = environment ;
stdID = studyID ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = 1;
end;

else do;
if environment = env and
StartDate - endDT in (0, 1, 2) then do;

*I added the "2" here and it seems to have fixed the problem where adjacent time periods in the same environment were not combined because they had a <= 24-hour period in between them in the other environment*;

endDT = EndDate;
daysTot = sum(of daysTot, daysbwn);
end; else do;
output;
env = environment ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = cnt + 1;
end;end;
if last.studyID then output;
run;

Thank you, Shmuel!

mkeintz · Posted 12-04-2016 10:14 PM


data want (drop=nxt_: total_: I grpsize);

  do grpsize=1 to 1000;
    merge have (drop=count)
          have (firstobs=2 keep=studyid environment startdate
            rename=(studyid=nxt_i environment=nxt_e  startdate=nxt_s));
    total_days=sum(total_days,daysbwn);
    if not(studyid=nxt_i and environment=nxt_e and
           nxt_s=enddate+1 and environment='SCHOOL') then leave;
  end;

  daysbwn=total_days;
  if studyid^=lag(studyid) then count=1;
  else count+1;

  do I=1 to grpsize;
    set have (drop=count daysbwn mvmtdate enddate);
    if I=1 then output;
  end;
run;

This program has two loops in the data step The first loop gets the grpsize and calculates the totaldays. The second loop re-read the same records, but outputs only the first in the group.

Notes:

The MERGE in the first loop merges HAVE with itself. The "right-hand" HAVE starts at observation 2, providing a way to know when the current record is not part of the same group as the next record. During that loop TOTALDAYS is accumulated.
The "do grpsie=1 to 1000; I use 1000 as a number that should exceed any record count for a single group. Inside the loop is an IF test the leaves the loop when the condiition is met.
Between the loops
1. DAYSBWN is set to totaldays.
2. COUNT is reset to 1 when a new studyid is started, otherwise it's just incremented.
In the second loop the SET statement does not include the variables daysbwn mvmtdate enddate which means those values will be taken from the end-of-group record read in by the first loop. But it does take all the other vars which means that they come from the first record in a group.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Shmuel · Posted 12-05-2016 02:57 AM

This time I run my code and got the desired output:

data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;

data temp;
set have;
if StartDate > EndDate then delete; /* ignore "invalid" periods */
run;

proc sort data=temp; by studyID StartDate ; run;

data want(keep= env stdID StartDT endDT daysTot cnt shortPD
rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
daystot = daysbwn cnt = count shortPD = short_period));
set temp;
by studyID;
retain env stdID StartDT endDT daysTot cnt shortPD;

format startDT endDT ddmmyy10.;

if first.studyID then do;
env = environment ;
stdID = studyID ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = 1;
end;

else do;
if environment = env and
StartDate - endDT in (0, 1) then do;
endDT = EndDate;
daysTot = sum(of daysTot, daysbwn);
end; else do;
output;
env = environment ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = cnt + 1;
end;end;
if last.studyID then output;
run;

Kels123 · Posted 12-12-2016 02:33 PM

Thank you so much!

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away