I am interested in studying subjects in their home environment. I have a database of time periods that looks like this (sample data):
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
I simplified it to look like this because I decided that periods of time at home that last less than one full day (daysbwn=0) are not meaningful to this study:
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
Now I would like to effectively combine periods at school, whenever possible so that I can use one big period of time as opposed to periods of time at school fragmented by these <1 day periods of time at home. Basically, I would like to create a database that looks like this:
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0
SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0
The part I am struggling with is how to combine the time periods like this while ended up with the correct environment, StartDate, and EndDate variables. (MvmtDate is more or less irrelevant but I have left it in the data and sample data for tracking purposes.)
Thanks very much in advance!
Try this:
proc sort data=have out=want;
by studyID StartDate EndDate;
where daysbwn>0;
run;
data want(drop=_:);
set want end=lastObs;
by studyID environment notsorted;
retain _r_StartDate _r_EndDate;
_r_StartDate=min(StartDate,_r_StartDate);
_r_EndDate=max(EndDate,_r_EndDate);
if lastObs or last.environment then _outFlg=1;
else
do;
_nextRow=_n_+1;
set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
end;
if _outFlg=1 then
do;
StartDate=_r_StartDate;
EndDate=_r_EndDate;
output;
call missing(_r_StartDate, _r_EndDate);
end;
run;
In the next HOME lines the strat_date is greater then the end_date:
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
that gives the idea to filter lines where start_date > end_date.
The next step is to check:
is envirinment and studyID are same as on previous line and start_date - lag(end_dtae) in (0, 1)
then combine the two periods into one big period - BUT:
How to treat the other variables: count daysbwn short_period - use: min / max/ sum / first / last or other function ?
The daysbwn - seems you have SUMed.
The count - seems you saved the first or the MIN value or calulated ?
Is short_period - always = 0 ?
Try next code:
data temp;
set have;
if start_date > end_date then delete;
run;
proc sort data=temp; by environment studyID StartDate ; run;
data want(keep= env stdID StartDT daysTot cnt
rename = (env = environment stdID = studyID startDT = start_date end_dt = end_date
days_tot = daysbwn cnt = count));
set temp;
by environment studyID ;
retain env stdID StartDT endDT daysTot cnt;
if first.studyID then do;
env = environment ;
stdID = studyID ;
startDT = start_date;
endDT = end_date;
daysTot = daysbwn;
If env = "SCHOOL" then cnt = 1; /* this line was edited */
end;
else do;
if start_date - endDT in (0, 1) then do;
endDT = end_date;
daysTot = sum(of daysTot, daysbwn);
end; else do;
output;
startDT = start_date;
endDT = end_date;
daysTot = daysbwn;
cnt = cnt + 1;
end;
if last.studyID then output;
run;
Please provide in the future a SAS datastep creating sample data so that we don't have to do the work.
Below code selects the start and end dates and only outputs a single record for consecutive dates. I haven't added the calculations for the other variables but this shouldn't be too hard to add.
data have;
infile datalines truncover dlm=' ';
input
environment :$9.
(StartDate EndDate) (:yymmdd.)
studyID
MvmtDate :mmddyy.
count
daysbwn
short_period;
format
StartDate EndDate MvmtDate date9.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
;
run;
proc sort data=have out=want;
by studyID StartDate;
where daysbwn>0;
run;
data want(drop=_:);
set want end=lastObs;
by studyID environment notsorted;
if lastObs or last.environment then output;
else
do;
_nextRow=_n_+1;
set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
if intck('day',EndDate,_NextStartDate) ne 1 then output;
end;
run;
Thank you so much for sending this code (sorry for creating more work for you than necessary)! It is almost exactly what I need. The only issue is that the StartDate is not always accurate in that it does not automatically include the very first StartDate of the period. For example, if this is the original output (desired StartDate in orange, desired EndDate in blue)...
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
Instead of this...
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0
SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0
I get this (dates in red represent the problematic start dates):
environment StartDate EndDate studyID MvmtDate count daysbwn short_period
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 1 2921 0
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 1 2170 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0
Do you have any additional advice on how to potentially modify your code to address this? Thank you again, in advance.
Try this:
proc sort data=have out=want;
by studyID StartDate EndDate;
where daysbwn>0;
run;
data want(drop=_:);
set want end=lastObs;
by studyID environment notsorted;
retain _r_StartDate _r_EndDate;
_r_StartDate=min(StartDate,_r_StartDate);
_r_EndDate=max(EndDate,_r_EndDate);
if lastObs or last.environment then _outFlg=1;
else
do;
_nextRow=_n_+1;
set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
end;
if _outFlg=1 then
do;
StartDate=_r_StartDate;
EndDate=_r_EndDate;
output;
call missing(_r_StartDate, _r_EndDate);
end;
run;
Hi,
I just noticed that the codes provided earlier on work for almost all scenarios, but I found a few exceptions shown as sample data below where the code did not combine all periods. (This is probably the first period's enddate is not immediately adjacent to the next period's startdate; there is a day missing in between the two periods because at least one time period <= 24 hours at SCHOOL was dropped. This is fine, but means that I might need a different code to combine these periods.)
data tempfile;
infile datalines truncover;
INPUT
environment$ StartDate yymmdd10.@+1 EndDate yymmdd10.@+1
studyID$4.@+1 MvmtDate mmddyy10.@+1
Movementcode Movecode daysbwn Movementcode2 Movecode2 count ;
format Startdate EndDate MvmtDate yymmdd10.;
datalines;
SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1
HOME 2007-01-10 2008-12-01 0001 12/01/2008 1 1 691 3 2 2
HOME 2008-12-03 2009-01-02 0001 01/02/2009 1 1 30 8 2 4
SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5
HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1
SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2
HOME 2008-02-23 2010-06-11 0002 06/11/2010 1 1 839 3 2 3
HOME 2010-06-13 2015-01-01 0002 06/12/2010 8 2 1663 8 2 5
;
run;
proc print data=tempfile; run;
I would like for the adjacent time periods that take place in the same environment (in red text) to be combined to look like this:
SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1
HOME 2007-01-10 2009-01-02 0001 01/02/2009 1 1 721 3 2 2
SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5
HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1
SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2
HOME 2008-02-23 2015-01-01 0002 06/12/2010 1 1 2502 3 2 3
Specifically (and most importantly), I would like to combine those periods using the startdate from the first period and the enddate from the last period.
If possible, I would also like the Movementcode2 and Movecode2 variables from the first period to be the ones applied to the combined period.
Less importantly are the daysbwn and count variables because I can easily just recode those myself. Movementcode and Movecode can be dropped.
Is there a code that can help me do this systematically and properly? It could be a modification to the code provided previously, or a second code I can apply after the initial code is applied.
Thank you very much in advance,
Kelsey
In response to my recent follow-up question, it looks like the code previously provided by Shmuel will do the trick, with one small modification:
data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;
/*
data temp;
set have;
if StartDate > EndDate then delete; /* ignore "invalid" periods */
run;
*/
*I preferred using this code to do the initial drop of invalid periods, but it should more or less do the same thing as the commented code above*;
data temp;
set have
where daysbwn>0; /* ignore "invalid" periods */
run;
*IMPORTANT CODE STARTS HERE*;
proc sort data=temp; by studyID StartDate ; run;
data want(keep= env stdID StartDT endDT daysTot cnt shortPD
rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
daystot = daysbwn cnt = count shortPD = short_period));
set temp;
by studyID;
retain env stdID StartDT endDT daysTot cnt shortPD;
format startDT endDT ddmmyy10.;
if first.studyID then do;
env = environment ;
stdID = studyID ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = 1;
end;
else do;
if environment = env and
StartDate - endDT in (0, 1, 2) then do;
*I added the "2" here and it seems to have fixed the problem where adjacent time periods in the same environment were not combined because they had a <= 24-hour period in between them in the other environment*;
endDT = EndDate;
daysTot = sum(of daysTot, daysbwn);
end; else do;
output;
env = environment ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = cnt + 1;
end;end;
if last.studyID then output;
run;
Thank you, Shmuel!
data want (drop=nxt_: total_: I grpsize);
do grpsize=1 to 1000;
merge have (drop=count)
have (firstobs=2 keep=studyid environment startdate
rename=(studyid=nxt_i environment=nxt_e startdate=nxt_s));
total_days=sum(total_days,daysbwn);
if not(studyid=nxt_i and environment=nxt_e and
nxt_s=enddate+1 and environment='SCHOOL') then leave;
end;
daysbwn=total_days;
if studyid^=lag(studyid) then count=1;
else count+1;
do I=1 to grpsize;
set have (drop=count daysbwn mvmtdate enddate);
if I=1 then output;
end;
run;
This program has two loops in the data step The first loop gets the grpsize and calculates the totaldays. The second loop re-read the same records, but outputs only the first in the group.
Notes:
This time I run my code and got the desired output:
data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;
data temp;
set have;
if StartDate > EndDate then delete; /* ignore "invalid" periods */
run;
proc sort data=temp; by studyID StartDate ; run;
data want(keep= env stdID StartDT endDT daysTot cnt shortPD
rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
daystot = daysbwn cnt = count shortPD = short_period));
set temp;
by studyID;
retain env stdID StartDT endDT daysTot cnt shortPD;
format startDT endDT ddmmyy10.;
if first.studyID then do;
env = environment ;
stdID = studyID ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = 1;
end;
else do;
if environment = env and
StartDate - endDT in (0, 1) then do;
endDT = EndDate;
daysTot = sum(of daysTot, daysbwn);
end; else do;
output;
env = environment ;
startDT = StartDate;
endDT = EndDate;
daysTot = daysbwn;
shortPD = short_period;
cnt = cnt + 1;
end;end;
if last.studyID then output;
run;
Thank you so much!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.