BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Kels123
Quartz | Level 8

I am interested in studying subjects in their home environment. I have a database of time periods that looks like this (sample data):

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

 

I simplified it to look like this because I decided that periods of time at home that last less than one full day (daysbwn=0) are not meaningful to this study:

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

 

Now I would like to effectively combine periods at school, whenever possible so that I can use one big period of time as opposed to periods of time at school fragmented by these <1 day periods of time at home. Basically, I would like to create a database that looks like this:

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0

HOME 2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

 

The part I am struggling with is how to combine the time periods like this while ended up with the correct environment, StartDate, and EndDate variables. (MvmtDate is more or less irrelevant but I have left it in the data and sample data for tracking purposes.)

 

Thanks very much in advance!

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

Try this:


proc sort data=have out=want;
  by studyID StartDate EndDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  retain _r_StartDate _r_EndDate;
  _r_StartDate=min(StartDate,_r_StartDate);
  _r_EndDate=max(EndDate,_r_EndDate);

  if lastObs or last.environment then _outFlg=1;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
    end;

  if _outFlg=1 then
    do;
      StartDate=_r_StartDate;
      EndDate=_r_EndDate;
      output;
      call missing(_r_StartDate, _r_EndDate);
    end;
run;

View solution in original post

9 REPLIES 9
Shmuel
Garnet | Level 18

In the next HOME lines the strat_date is greater then the end_date:

HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1

HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1

HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1

 

that gives the idea to filter lines where start_date > end_date.

 

The next step is to check:

    is envirinment and studyID are same as on previous line and start_date - lag(end_dtae) in (0, 1)

    then combine the two periods into one big period - BUT:

How to treat the other variables:  count daysbwn short_period - use: min / max/  sum / first / last or other function ?

The daysbwn - seems you have SUMed. 

The count -  seems you saved the first or the MIN value or calulated ?

Is short_period  -  always = 0 ?

 

Try next code: 

 

data temp;

  set have;

        if start_date > end_date then delete;

run;

proc sort data=temp; by environment  studyID StartDate ; run;

data want(keep= env stdID StartDT daysTot cnt

                  rename = (env = environment   stdID = studyID  startDT = start_date end_dt = end_date

                                     days_tot = daysbwn cnt = count));

 set temp;

   by environment  studyID ;

        retain env stdID StartDT endDT daysTot cnt;

  if first.studyID  then do;

     env = environment  ;

     stdID = studyID ;

     startDT start_date;

     endDT = end_date;

     daysTot = daysbwn;

     If env = "SCHOOL" then cnt = 1;     /* this line was edited */

  end;

  else do;

     if  start_date -  endDT in (0, 1) then  do;

         endDT = end_date;

         daysTot = sum(of daysTot, daysbwn);

     end; else do;

              output;              

             startDT start_date;

             endDT = end_date;

             daysTot = daysbwn;

             cnt = cnt + 1;

  end;

 if last.studyID  then output;

run;

 

Patrick
Opal | Level 21

Please provide in the future a SAS datastep creating sample data so that we don't have to do the work.

 

Below code selects the start and end dates and only outputs a single record for consecutive dates. I haven't added the calculations for the other variables but this shouldn't be too hard to add.

data have;
  infile datalines truncover dlm=' ';
  input 
    environment :$9.
    (StartDate EndDate) (:yymmdd.)
    studyID 
    MvmtDate :mmddyy.
    count 
    daysbwn 
    short_period;

  format
    StartDate EndDate MvmtDate date9.;

  datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
;
run;

proc sort data=have out=want;
  by studyID StartDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  if lastObs or last.environment then output;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) ne 1 then output;
    end;
run;
Kels123
Quartz | Level 8

Thank you so much for sending this code (sorry for creating more work for you than necessary)! It is almost exactly what I need. The only issue is that the StartDate is not always accurate in that it does not automatically include the very first StartDate of the period. For example, if this is the original output (desired StartDate in orange, desired EndDate in blue)...

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0

SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0

SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0

HOME     2012-12-13 2014-01-01 0002 01/02/2014 6 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0

 

Instead of this...

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2007-01-01 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2007-01-01 2012-12-12 0002 12/13/2012 1 2170 0

HOME     2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

 

I get this (dates in red represent the problematic start dates):

 

environment StartDate EndDate studyID MvmtDate count daysbwn short_period

SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 1 2921 0

SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 1 2170 0

HOME     2012-12-13 2014-01-01 0002 01/02/2014 2 384 0

SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 3 365 0

 

Do you have any additional advice on how to potentially modify your code to address this? Thank you again, in advance.

Patrick
Opal | Level 21

Try this:


proc sort data=have out=want;
  by studyID StartDate EndDate;
  where daysbwn>0;
run;

data want(drop=_:);
  set want end=lastObs;
  by studyID environment notsorted;

  retain _r_StartDate _r_EndDate;
  _r_StartDate=min(StartDate,_r_StartDate);
  _r_EndDate=max(EndDate,_r_EndDate);

  if lastObs or last.environment then _outFlg=1;
  else 
    do;
      _nextRow=_n_+1;
      set want(keep=StartDate rename=(StartDate=_NextStartDate)) point=_nextRow nobs=_nobs;
      if intck('day',EndDate,_NextStartDate) >1 then _outFlg=1;
    end;

  if _outFlg=1 then
    do;
      StartDate=_r_StartDate;
      EndDate=_r_EndDate;
      output;
      call missing(_r_StartDate, _r_EndDate);
    end;
run;
Kels123
Quartz | Level 8

Hi,

 

I just noticed that the codes provided earlier on work for almost all scenarios, but I found a few exceptions shown as sample data below where the code did not combine all periods. (This is probably the first period's enddate is not immediately adjacent to the next period's startdate; there is a day missing in between the two periods because at least one time period <= 24 hours at SCHOOL was dropped. This is fine, but means that I might need a different code to combine these periods.)

 

data tempfile;

infile datalines truncover;

INPUT

environment$ StartDate yymmdd10.@+1 EndDate yymmdd10.@+1

studyID$4.@+1 MvmtDate mmddyy10.@+1

Movementcode Movecode daysbwn Movementcode2 Movecode2 count ;

format Startdate EndDate MvmtDate yymmdd10.;

datalines;

SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1

HOME 2007-01-10 2008-12-01 0001 12/01/2008 1 1 691 3 2 2

HOME 2008-12-03 2009-01-02 0001 01/02/2009 1 1 30 8 2 4

SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5

HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1

SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2

HOME 2008-02-23 2010-06-11 0002 06/11/2010 1 1 839 3 2 3

HOME 2010-06-13 2015-01-01 0002 06/12/2010 8 2 1663 8 2 5

;

run;

proc print data=tempfile; run;

 

I would like for the adjacent time periods that take place in the same environment (in red text) to be combined to look like this:

 

SCHOOL 2007-01-01 2007-01-09 0001 01/09/2007 3 2 8 . . 1

HOME 2007-01-10 2009-01-02 0001 01/02/2009 1 1 721 3 2 2

SCHOOL 2009-01-03 2009-05-04 0001 05/04/2009 3 2 121 1 1 5

HOME 2007-01-01 2007-07-25 0002 07/25/2007 1 1 205 . . 1

SCHOOL 2007-07-26 2008-02-22 0002 02/22/2008 3 2 211 1 1 2

HOME 2008-02-23 2015-01-01 0002 06/12/2010 1 1 2502 3 2 3

 

Specifically (and most importantly), I would like to combine those periods using the startdate from the first period and the enddate from the last period.

If possible, I would also like the Movementcode2 and Movecode2 variables from the first period to be the ones applied to the combined period.

Less importantly are the daysbwn and count variables because I can easily just recode those myself. Movementcode and Movecode can be dropped.

 

Is there a code that can help me do this systematically and properly? It could be a modification to the code provided previously, or a second code I can apply after the initial code is applied.

 

Thank you very much in advance,

 

Kelsey

 

 

Kels123
Quartz | Level 8

In response to my recent follow-up question, it looks like the code previously provided by Shmuel will do the trick, with one small modification:

 

data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;

 

/*

data temp;
  set have;
        if StartDate > EndDate then delete;   /* ignore "invalid" periods */
run;

*/

*I preferred using this code to do the initial drop of invalid periods, but it should more or less do the same thing as the commented code above*; 

data temp;

set have

where daysbwn>0;  /* ignore "invalid" periods */

run;

 

*IMPORTANT CODE STARTS HERE*;

 

proc sort data=temp; by studyID StartDate ; run;

 

data want(keep= env stdID StartDT endDT daysTot cnt shortPD
            rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
                              daystot = daysbwn cnt = count shortPD = short_period));
  set temp;
    by studyID;
        retain env stdID StartDT endDT daysTot cnt shortPD;

       format startDT endDT ddmmyy10.;


        if first.studyID then do;
           env = environment ;
           stdID = studyID ;
           startDT = StartDate;
          endDT = EndDate;
          daysTot = daysbwn;
          shortPD = short_period;
          cnt = 1;        
     end;

    else do;
         if environment = env and
            StartDate - endDT in (0, 1, 2) then do;   

*I added the "2" here and it seems to have fixed the problem where adjacent time periods in the same environment were not combined because they had a <= 24-hour period in between them in the other environment*;


            endDT = EndDate;
            daysTot = sum(of daysTot, daysbwn);
       end; else do;
            output;
            env = environment ;
            startDT = StartDate;
            endDT = EndDate;
            daysTot = daysbwn;
            shortPD = short_period;
           cnt = cnt + 1;
   end;end;
   if last.studyID then output;
run;

 

Thank you, Shmuel!

mkeintz
PROC Star

data want (drop=nxt_: total_: I grpsize);

  do grpsize=1 to 1000;
    merge have (drop=count)
          have (firstobs=2 keep=studyid environment startdate
            rename=(studyid=nxt_i environment=nxt_e  startdate=nxt_s));
    total_days=sum(total_days,daysbwn);
    if not(studyid=nxt_i and environment=nxt_e and
           nxt_s=enddate+1 and environment='SCHOOL') then leave;
  end;

  daysbwn=total_days;
  if studyid^=lag(studyid) then count=1;
  else count+1;

  do I=1 to grpsize;
    set have (drop=count daysbwn mvmtdate enddate);
    if I=1 then output;
  end;
run;

This program has two loops in the data step  The first loop gets the grpsize and calculates the totaldays. The second loop re-read the same records, but outputs only the first in the group.

 

 

Notes:

  1. The MERGE in the first loop merges HAVE with itself.  The "right-hand" HAVE starts at observation 2, providing a way to know when the current record is not part of the same group as the next record.  During that loop TOTALDAYS is accumulated.

  2. The "do grpsie=1 to 1000;  I use 1000 as a number that should exceed any record count for a single group.  Inside the loop is an IF test the leaves the loop when the condiition is met.

  3. Between the loops
    1. DAYSBWN is set to totaldays.
    2. COUNT is reset to 1 when a new studyid is started, otherwise it's just incremented.
  4. In the second loop the SET statement does not include the variables daysbwn mvmtdate enddate which means those values will be taken from the end-of-group record read in by the first loop.  But it does take all the other vars which means that they come from the first record in a group.
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
Shmuel
Garnet | Level 18

This time I run my code and got the desired output:

 

data have;
infile datalines truncover;
INPUT environment $ StartDate yymmdd10. @+1 EndDate yymmdd10.
@+1 studyID $4. @+1 MvmtDate $10. count daysbwn short_period ;
format Startdate EndDate ddmmyy10.;
datalines;
SCHOOL 2007-01-01 2009-02-12 0001 02/13/2009 1 773 0
HOME 2009-02-13 2009-02-12 0001 02/13/2009 2 0 1
SCHOOL 2009-02-13 2015-01-01 0001 02/13/2009 3 2148 0
SCHOOL 2007-01-01 2010-04-05 0002 04/06/2010 1 1190 0
HOME 2010-04-06 2010-04-05 0002 04/06/2010 2 0 1
SCHOOL 2010-04-06 2010-07-05 0002 07/06/2010 3 90 0
HOME 2010-07-06 2010-07-05 0002 07/06/2010 4 0 1
SCHOOL 2010-07-06 2012-12-12 0002 12/13/2012 5 890 0
HOME 2012-12-13 2014-01-01 0002 01/02/2014 6 384 0
SCHOOL 2014-01-02 2015-01-01 0002 01/02/2014 7 365 0
; run;

 

data temp;
  set have;
        if StartDate > EndDate then delete;   /* ignore "invalid" periods */
run;

 

proc sort data=temp; by studyID StartDate ; run;

 

data want(keep= env stdID StartDT endDT daysTot cnt shortPD
            rename = (env = environment stdID = studyID startDT = StartDate endDT = EndDate
                              daystot = daysbwn cnt = count shortPD = short_period));
  set temp;
    by studyID;
        retain env stdID StartDT endDT daysTot cnt shortPD;

       format startDT endDT ddmmyy10.;


        if first.studyID then do;
           env = environment ;
           stdID = studyID ;
           startDT = StartDate;
          endDT = EndDate;
          daysTot = daysbwn;
          shortPD = short_period;
          cnt = 1;        
     end;

    else do;
         if environment = env and
            StartDate - endDT in (0, 1) then do;
            endDT = EndDate;
            daysTot = sum(of daysTot, daysbwn);
       end; else do;
            output;
            env = environment ;
            startDT = StartDate;
            endDT = EndDate;
            daysTot = daysbwn;
            shortPD = short_period;
           cnt = cnt + 1;
   end;end;
   if last.studyID then output;
run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 9 replies
  • 2705 views
  • 2 likes
  • 4 in conversation