Solved: Split dates and update values of a variable

NewUsrStat · Posted 08-22-2022 05:04 PM

Hi guys,

I have two data sets and, for each ID, I have to split each row of the first including the intervals of the second. Furthermore, while doing this, I need to divide the values in the Count variable of the first data set by the days in the new interval but only when the period refers to the first data set and not to the second because in this last case I will have a Label. Can anyone help me please? Here the data set and desired output:

data work;

input ID :$20. Start :date9. End :date9. Counts :$20.;

format start end yymmdd10.;

cards;

0001 01JAN2015 18JAN2015 32499

0001 19JAN2015 30APR2015 23458

0001 01MAY2015 13JAN2016 987

0002 06FEB2019 15FEB2019 3567

0002 16FEB2019 31DEC2019 1254

0003 01JAN2020 18MAR2020 652

0003 19MAR2020 02APR2020 765

0003 03APR2020 31DEC2020 200

data vacations;

input ID :$20. Start :date9. End :date9. Label :$20.;

format start end yymmdd10.;

cards;

0001 03JAN2015 10JAN2015 A

0001 21JAN2015 21JAN2015 B

0001 03MAY2015 09JAN2016 C

0002 08FEB2019 08FEB2019 B

0002 09FEB2019 15FEB2019 A

0003 20FEB2020 25FEB2020 C

0003 21MAR2020 21MAR2020 B

0003 04JAN2020 10JAN2020 A

;

data want;

input ID :$20. Start :date9. End :date9. Label :$20. Counts_resized :$20.;

format start end yymmdd10.;

cards;

0001 01JAN2015 02JAN2015 ?

0001 03JAN2015 10JAN2015 A

0001 11JAN2015 18JAN2015 ?

0001 19JAN2015 20JAN2015 ?

0001 21JAN2015 21JAN2015 B

0001 22JAN2015 30APR2015 ?

0001 01MAY2015 02MAY2015 ?

0001 03MAY2015 09JAN2016 C

0001 10MAY2016 13JAN2016 ?

………….

;

Thank you in advance

mkeintz · Posted 08-23-2022 01:53 AM

By "split dates", you apparently mean you want to create one observation per time period, where a period is either.

The time span of a vacation, which will keep the label of the vacation.
Residual contiguous components of a WORK time span, which will be assigned a label of "?".

In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.

You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span. And then, for the time periods above, add up those daily counts values.

I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.

Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc. Then update the LABELS history with the labels from the VACATIONS dataset.

Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.

In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.


/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021;  /* Actually one day AFTER end of study */

data want (drop=_:);
  /* For each ID, read WORK first, then VACATIONS*/
  set work (in=inw)  vacations (in=inv) ;
  by id;

  array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)}    _temporary_;
  array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
  format _min_date _max_date _d date9.;

  if first.id then do;
    call missing(of _count_history{*},of _label_history{*},_max_date);
    _min_date=Start;
    _initial_label='?';
  end;
  retain _min_date _max_date _initial_label;

  if inw then do;
    _count_per_day=counts/(1+End-Start);
    _max_date=End;
    do _d=Start to End;  
      _count_history{_d}=_count_per_day;
      _label_history{_d}=_initial_label;
    end;
    if _initial_label='?' then _initial_label='!';
    else _initial_label='?';
  end;
  else if inv then do _d=Start to End;
      _label_history{_d}=Label;
  end;

  if last.id;
  do Start=_min_date by 0 until (Start>_max_date);
    label=_label_history{Start};
    counts_resized=0;
    do End=Start by 1 until (_label_history{End+1}^=label);
      counts_resized=counts_resized+_count_history{End};
    end;
    if label='!' then label='?';
    output;
    Start=End+1;
  end;

run;

This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest. It actually doesn't matter whether the intermediate WORK obs are also sorted by date. And it doesn't matter whether VACATIONS are sorted within an ID.

And if vacations cross work boundaries, this program keeps the vacation as a single time span.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

ballardw · Posted 08-22-2022 05:28 PM

Your want doesn't show an example of the "need to divide the values in the Count variable of the first data set by the days in the new interval".

I think that you need to confirm that this is correct in the Work set:

0003 19MAR2020 02JAN2020 765
0003 02JAN2020 31DEC2020 200

First the 02JAN2020 "end" date is considerably before the "start" which means the "interval" is a negative number of days.

Then the 02JAN2020 starts on the same day the previous "ended" but considerably overlaps the first interval for ID 0003

You have similar issues in the Vacation set where the "end" is before the "start" and then you have a very significant overlap from Jan2020 to Aug2020 that overlaps the first 2.

0003 20FEB2020 10FEB2020 C
0003 21MAR2020 21MAR2020 B
0003 04JAN2020 09AUG2020 A

You have to tell use how you intend/expect the "overlap" intervals to be compared and what the end before start (generally a bad thing) means and how that impacts this process.

What are we supposed to output for the ?

Your Want data set should address all the values of the example data. You have only done Id=0001

NewUsrStat · Posted 08-23-2022 01:31 AM

Sorry, I made errors while typing. I edited. ? means the split of for example 32.499 based on the time periods (days) 01JAN2015 02JAN2015 and 11JAN2015 18JAN2015 in want data set. 18JAN2015 is the end date 32.499 refers to in the work dataset. I only reported the output relative to 0001 for simplicity.

mkeintz · Posted 08-23-2022 01:53 AM

By "split dates", you apparently mean you want to create one observation per time period, where a period is either.

The time span of a vacation, which will keep the label of the vacation.
Residual contiguous components of a WORK time span, which will be assigned a label of "?".

In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.

You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span. And then, for the time periods above, add up those daily counts values.

I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.

Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc. Then update the LABELS history with the labels from the VACATIONS dataset.

Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.

In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.


/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021;  /* Actually one day AFTER end of study */

data want (drop=_:);
  /* For each ID, read WORK first, then VACATIONS*/
  set work (in=inw)  vacations (in=inv) ;
  by id;

  array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)}    _temporary_;
  array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
  format _min_date _max_date _d date9.;

  if first.id then do;
    call missing(of _count_history{*},of _label_history{*},_max_date);
    _min_date=Start;
    _initial_label='?';
  end;
  retain _min_date _max_date _initial_label;

  if inw then do;
    _count_per_day=counts/(1+End-Start);
    _max_date=End;
    do _d=Start to End;  
      _count_history{_d}=_count_per_day;
      _label_history{_d}=_initial_label;
    end;
    if _initial_label='?' then _initial_label='!';
    else _initial_label='?';
  end;
  else if inv then do _d=Start to End;
      _label_history{_d}=Label;
  end;

  if last.id;
  do Start=_min_date by 0 until (Start>_max_date);
    label=_label_history{Start};
    counts_resized=0;
    do End=Start by 1 until (_label_history{End+1}^=label);
      counts_resized=counts_resized+_count_history{End};
    end;
    if label='!' then label='?';
    output;
    Start=End+1;
  end;

run;

This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest. It actually doesn't matter whether the intermediate WORK obs are also sorted by date. And it doesn't matter whether VACATIONS are sorted within an ID.

And if vacations cross work boundaries, this program keeps the vacation as a single time span.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

NewUsrStat · Posted 08-23-2022 05:36 AM

Yes, basically I have to assign counts only to days out of vacations, i.e. working days. I think you addressed the point. Thank you for your help!

NewUsrStat · Posted 09-06-2022 10:36 AM

Thank you dottor mkeintz. Your code works perfectly but counts are calculated also for vacations that is what is not desirable. In other words the counts must be 0 where the label is not "?" and the total, for each id must be scaled by the rows where the label is "?". I tried to put an if statement if(label ="?") then _count_per_day=counts/(1+End-Start) etc but it only sets 0 where the label is not "?" without scaling the counts on the days corresponding to "?".

Ksharp · Posted 09-07-2022 07:43 AM

data work;
  input ID :$20. Start :date9. End :date9. Counts ;
  count_per_day=counts/(1+End-Start);
  format start end yymmdd10.;
cards;
0001 01JAN2015 18JAN2015 32499
0001 19JAN2015 30APR2015 23458
0001 01MAY2015 13JAN2016 987
0002 06FEB2019 15FEB2019 3567
0002 16FEB2019 31DEC2019 1254
0003 01JAN2020 18MAR2020 652
0003 19MAR2020 02APR2020 765
0003 03APR2020 31DEC2020 200
;
data vacations;
  input ID :$20. Start :date9. End :date9. Label :$20.;
  format start end yymmdd10.;
cards;
0001 03JAN2015 10JAN2015 A   
0001 21JAN2015 21JAN2015 B
0001 03MAY2015 09JAN2016 C
0002 08FEB2019 08FEB2019 B
0002 09FEB2019 15FEB2019 A
0003 20FEB2020 25FEB2020 C
0003 21MAR2020 21MAR2020 B
0003 04JAN2020 10JAN2020 A
;
proc summary data=work;
by id;
var start end;
output out=temp(drop=_:) min(start)=start max(end)=end;
run;
data all;
 set temp;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date ;
run;
data count_per_day;
 set work;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date count_per_day;
run;
data k(rename=(end=date));
 set work(keep=id End);
run;
data key(index=(x=(id date)));
 set vacations;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date Label;
run;
data all2;
if _n_=1 then do;
 if 0 then set count_per_day;
 declare hash c(dataset:'count_per_day');
 c.definekey('id','date');
 c.definedata('count_per_day');
 c.definedone();
 if 0 then set k;
 declare hash h(dataset:'k');
 h.definekey('id','date');
 h.definedone();
end;
 merge all key;
 by id date;
 call missing(count_per_day);
 rc=c.find();
 if h.check()=0 then label2='X';
 drop rc;
run;
data all3;
 set all2;
 if id ne lag(id) or label ne lag(label) or lag(label2)='X' then group+1;
run;
proc sql;
create table want as
select id,min(date) as start format=date9.,max(date) as end format=date9.,max(count_per_day) as count_per_day,max(label) as label
 from all3
  group by id,group;
quit;
data want;
 set want;
 if not missing(label) then count_per_day=0;
run;

Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable

Re: Split dates and update values of a variable