BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
NewUsrStat
Pyrite | Level 9

Hi guys, 

I have two data sets and, for each ID, I have to split each row of the first including the intervals of the second. Furthermore, while doing this, I need to divide the values in the Count variable of the first data set by the days in the new interval but only when the period refers to the first data set and not to the second because in this last case I will have a Label. Can anyone help me please? Here the data set and desired output:

 

data work;

  input ID :$20. Start :date9. End :date9. Counts :$20.;

  format start end yymmdd10.;

cards;

0001 01JAN2015 18JAN2015 32499

0001 19JAN2015 30APR2015 23458

0001 01MAY2015 13JAN2016 987

0002 06FEB2019 15FEB2019 3567

0002 16FEB2019 31DEC2019 1254

0003 01JAN2020 18MAR2020 652

0003 19MAR2020 02APR2020 765

0003 03APR2020 31DEC2020 200

 

data vacations;

  input ID :$20. Start :date9. End :date9. Label :$20.;

  format start end yymmdd10.;

cards;

0001 03JAN2015 10JAN2015 A   

0001 21JAN2015 21JAN2015 B

0001 03MAY2015 09JAN2016 C

0002 08FEB2019 08FEB2019 B

0002 09FEB2019 15FEB2019 A

0003 20FEB2020 25FEB2020 C

0003 21MAR2020 21MAR2020 B

0003 04JAN2020 10JAN2020 A

;

 

data want;

  input ID :$20. Start :date9. End :date9. Label :$20. Counts_resized :$20.;

  format start end yymmdd10.;

cards;

0001 01JAN2015 02JAN2015 ?

0001 03JAN2015 10JAN2015 A

0001 11JAN2015 18JAN2015 ?

0001 19JAN2015 20JAN2015 ?

0001 21JAN2015 21JAN2015 B

0001 22JAN2015 30APR2015 ?

0001 01MAY2015 02MAY2015 ?

0001 03MAY2015 09JAN2016 C

0001 10MAY2016 13JAN2016 ?

………….

;

Thank you in advance

 

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

By "split dates", you apparently mean you want to create one observation per time period, where a period is either.

  1. The time span of a vacation, which will keep the label of the vacation.
  2. Residual contiguous components of a WORK time span, which will be assigned a label of "?".

In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.

 

You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span.  And then, for the time periods above, add up those daily counts values.

 

I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.

 

Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc.  Then update the LABELS history with the labels from the VACATIONS dataset.

 

Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.

 

In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.

 


/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021;  /* Actually one day AFTER end of study */

data want (drop=_:);
  /* For each ID, read WORK first, then VACATIONS*/
  set work (in=inw)  vacations (in=inv) ;
  by id;

  array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)}    _temporary_;
  array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
  format _min_date _max_date _d date9.;

  if first.id then do;
    call missing(of _count_history{*},of _label_history{*},_max_date);
    _min_date=Start;
    _initial_label='?';
  end;
  retain _min_date _max_date _initial_label;

  if inw then do;
    _count_per_day=counts/(1+End-Start);
    _max_date=End;
    do _d=Start to End;  
      _count_history{_d}=_count_per_day;
      _label_history{_d}=_initial_label;
    end;
    if _initial_label='?' then _initial_label='!';
    else _initial_label='?';
  end;
  else if inv then do _d=Start to End;
      _label_history{_d}=Label;
  end;

  if last.id;
  do Start=_min_date by 0 until (Start>_max_date);
    label=_label_history{Start};
    counts_resized=0;
    do End=Start by 1 until (_label_history{End+1}^=label);
      counts_resized=counts_resized+_count_history{End};
    end;
    if label='!' then label='?';
    output;
    Start=End+1;
  end;

run;

 

This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest.  It actually doesn't matter whether the intermediate WORK obs are also sorted by date.  And it doesn't matter whether VACATIONS are sorted within an ID.

 

And if vacations cross work boundaries, this program keeps the vacation as a single time span.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

6 REPLIES 6
ballardw
Super User

Your want doesn't show an example of the "need to divide the values in the Count variable of the first data set by the days in the new interval".

 

I think that you need to confirm that this is correct in the Work set:

0003 19MAR2020 02JAN2020 765
0003 02JAN2020 31DEC2020 200

First the 02JAN2020 "end" date is considerably before the "start" which means the "interval" is a negative number of days.

Then the 02JAN2020 starts on the same day the previous "ended"  but considerably overlaps the first interval for ID 0003

 

You have similar issues in the Vacation set where the "end" is before the "start" and then you have a very significant overlap from Jan2020 to Aug2020 that overlaps the first 2.

0003 20FEB2020 10FEB2020 C
0003 21MAR2020 21MAR2020 B
0003 04JAN2020 09AUG2020 A

You have to tell use how you intend/expect the "overlap" intervals to be compared and what the end before start (generally a bad thing) means and how that impacts this process.

 

What are we supposed to output for the ?

 

Your Want data set should address all the values of the example data. You have only done Id=0001

 

 

NewUsrStat
Pyrite | Level 9
Sorry, I made errors while typing. I edited. ? means the split of for example 32.499 based on the time periods (days) 01JAN2015 02JAN2015 and 11JAN2015 18JAN2015 in want data set. 18JAN2015 is the end date 32.499 refers to in the work dataset. I only reported the output relative to 0001 for simplicity.
mkeintz
PROC Star

By "split dates", you apparently mean you want to create one observation per time period, where a period is either.

  1. The time span of a vacation, which will keep the label of the vacation.
  2. Residual contiguous components of a WORK time span, which will be assigned a label of "?".

In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.

 

You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span.  And then, for the time periods above, add up those daily counts values.

 

I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.

 

Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc.  Then update the LABELS history with the labels from the VACATIONS dataset.

 

Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.

 

In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.

 


/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021;  /* Actually one day AFTER end of study */

data want (drop=_:);
  /* For each ID, read WORK first, then VACATIONS*/
  set work (in=inw)  vacations (in=inv) ;
  by id;

  array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)}    _temporary_;
  array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
  format _min_date _max_date _d date9.;

  if first.id then do;
    call missing(of _count_history{*},of _label_history{*},_max_date);
    _min_date=Start;
    _initial_label='?';
  end;
  retain _min_date _max_date _initial_label;

  if inw then do;
    _count_per_day=counts/(1+End-Start);
    _max_date=End;
    do _d=Start to End;  
      _count_history{_d}=_count_per_day;
      _label_history{_d}=_initial_label;
    end;
    if _initial_label='?' then _initial_label='!';
    else _initial_label='?';
  end;
  else if inv then do _d=Start to End;
      _label_history{_d}=Label;
  end;

  if last.id;
  do Start=_min_date by 0 until (Start>_max_date);
    label=_label_history{Start};
    counts_resized=0;
    do End=Start by 1 until (_label_history{End+1}^=label);
      counts_resized=counts_resized+_count_history{End};
    end;
    if label='!' then label='?';
    output;
    Start=End+1;
  end;

run;

 

This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest.  It actually doesn't matter whether the intermediate WORK obs are also sorted by date.  And it doesn't matter whether VACATIONS are sorted within an ID.

 

And if vacations cross work boundaries, this program keeps the vacation as a single time span.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
NewUsrStat
Pyrite | Level 9

Yes, basically I have to assign counts only to days out of vacations, i.e. working days. I think you addressed the point. Thank you for your help!

NewUsrStat
Pyrite | Level 9
Thank you dottor mkeintz. Your code works perfectly but counts are calculated also for vacations that is what is not desirable. In other words the counts must be 0 where the label is not "?" and the total, for each id must be scaled by the rows where the label is "?". I tried to put an if statement if(label ="?") then _count_per_day=counts/(1+End-Start) etc but it only sets 0 where the label is not "?" without scaling the counts on the days corresponding to "?".
Ksharp
Super User
data work;
  input ID :$20. Start :date9. End :date9. Counts ;
  count_per_day=counts/(1+End-Start);
  format start end yymmdd10.;
cards;
0001 01JAN2015 18JAN2015 32499
0001 19JAN2015 30APR2015 23458
0001 01MAY2015 13JAN2016 987
0002 06FEB2019 15FEB2019 3567
0002 16FEB2019 31DEC2019 1254
0003 01JAN2020 18MAR2020 652
0003 19MAR2020 02APR2020 765
0003 03APR2020 31DEC2020 200
;
data vacations;
  input ID :$20. Start :date9. End :date9. Label :$20.;
  format start end yymmdd10.;
cards;
0001 03JAN2015 10JAN2015 A   
0001 21JAN2015 21JAN2015 B
0001 03MAY2015 09JAN2016 C
0002 08FEB2019 08FEB2019 B
0002 09FEB2019 15FEB2019 A
0003 20FEB2020 25FEB2020 C
0003 21MAR2020 21MAR2020 B
0003 04JAN2020 10JAN2020 A
;
proc summary data=work;
by id;
var start end;
output out=temp(drop=_:) min(start)=start max(end)=end;
run;
data all;
 set temp;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date ;
run;
data count_per_day;
 set work;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date count_per_day;
run;
data k(rename=(end=date));
 set work(keep=id End);
run;
data key(index=(x=(id date)));
 set vacations;
 do date=start to end;
  output;
 end;
 format date yymmdd10.;
 keep id date Label;
run;
data all2;
if _n_=1 then do;
 if 0 then set count_per_day;
 declare hash c(dataset:'count_per_day');
 c.definekey('id','date');
 c.definedata('count_per_day');
 c.definedone();
 if 0 then set k;
 declare hash h(dataset:'k');
 h.definekey('id','date');
 h.definedone();
end;
 merge all key;
 by id date;
 call missing(count_per_day);
 rc=c.find();
 if h.check()=0 then label2='X';
 drop rc;
run;
data all3;
 set all2;
 if id ne lag(id) or label ne lag(label) or lag(label2)='X' then group+1;
run;
proc sql;
create table want as
select id,min(date) as start format=date9.,max(date) as end format=date9.,max(count_per_day) as count_per_day,max(label) as label
 from all3
  group by id,group;
quit;
data want;
 set want;
 if not missing(label) then count_per_day=0;
run;

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 2285 views
  • 2 likes
  • 4 in conversation