Hi guys,
I have two data sets and, for each ID, I have to split each row of the first including the intervals of the second. Furthermore, while doing this, I need to divide the values in the Count variable of the first data set by the days in the new interval but only when the period refers to the first data set and not to the second because in this last case I will have a Label. Can anyone help me please? Here the data set and desired output:
data work;
input ID :$20. Start :date9. End :date9. Counts :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 18JAN2015 32499
0001 19JAN2015 30APR2015 23458
0001 01MAY2015 13JAN2016 987
0002 06FEB2019 15FEB2019 3567
0002 16FEB2019 31DEC2019 1254
0003 01JAN2020 18MAR2020 652
0003 19MAR2020 02APR2020 765
0003 03APR2020 31DEC2020 200
data vacations;
input ID :$20. Start :date9. End :date9. Label :$20.;
format start end yymmdd10.;
cards;
0001 03JAN2015 10JAN2015 A
0001 21JAN2015 21JAN2015 B
0001 03MAY2015 09JAN2016 C
0002 08FEB2019 08FEB2019 B
0002 09FEB2019 15FEB2019 A
0003 20FEB2020 25FEB2020 C
0003 21MAR2020 21MAR2020 B
0003 04JAN2020 10JAN2020 A
;
data want;
input ID :$20. Start :date9. End :date9. Label :$20. Counts_resized :$20.;
format start end yymmdd10.;
cards;
0001 01JAN2015 02JAN2015 ?
0001 03JAN2015 10JAN2015 A
0001 11JAN2015 18JAN2015 ?
0001 19JAN2015 20JAN2015 ?
0001 21JAN2015 21JAN2015 B
0001 22JAN2015 30APR2015 ?
0001 01MAY2015 02MAY2015 ?
0001 03MAY2015 09JAN2016 C
0001 10MAY2016 13JAN2016 ?
………….
;
Thank you in advance
By "split dates", you apparently mean you want to create one observation per time period, where a period is either.
In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.
You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span. And then, for the time periods above, add up those daily counts values.
I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.
Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc. Then update the LABELS history with the labels from the VACATIONS dataset.
Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.
In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.
/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021; /* Actually one day AFTER end of study */
data want (drop=_:);
/* For each ID, read WORK first, then VACATIONS*/
set work (in=inw) vacations (in=inv) ;
by id;
array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} _temporary_;
array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
format _min_date _max_date _d date9.;
if first.id then do;
call missing(of _count_history{*},of _label_history{*},_max_date);
_min_date=Start;
_initial_label='?';
end;
retain _min_date _max_date _initial_label;
if inw then do;
_count_per_day=counts/(1+End-Start);
_max_date=End;
do _d=Start to End;
_count_history{_d}=_count_per_day;
_label_history{_d}=_initial_label;
end;
if _initial_label='?' then _initial_label='!';
else _initial_label='?';
end;
else if inv then do _d=Start to End;
_label_history{_d}=Label;
end;
if last.id;
do Start=_min_date by 0 until (Start>_max_date);
label=_label_history{Start};
counts_resized=0;
do End=Start by 1 until (_label_history{End+1}^=label);
counts_resized=counts_resized+_count_history{End};
end;
if label='!' then label='?';
output;
Start=End+1;
end;
run;
This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest. It actually doesn't matter whether the intermediate WORK obs are also sorted by date. And it doesn't matter whether VACATIONS are sorted within an ID.
And if vacations cross work boundaries, this program keeps the vacation as a single time span.
Your want doesn't show an example of the "need to divide the values in the Count variable of the first data set by the days in the new interval".
I think that you need to confirm that this is correct in the Work set:
0003 19MAR2020 02JAN2020 765 0003 02JAN2020 31DEC2020 200
First the 02JAN2020 "end" date is considerably before the "start" which means the "interval" is a negative number of days.
Then the 02JAN2020 starts on the same day the previous "ended" but considerably overlaps the first interval for ID 0003
You have similar issues in the Vacation set where the "end" is before the "start" and then you have a very significant overlap from Jan2020 to Aug2020 that overlaps the first 2.
0003 20FEB2020 10FEB2020 C 0003 21MAR2020 21MAR2020 B 0003 04JAN2020 09AUG2020 A
You have to tell use how you intend/expect the "overlap" intervals to be compared and what the end before start (generally a bad thing) means and how that impacts this process.
What are we supposed to output for the ?
Your Want data set should address all the values of the example data. You have only done Id=0001
By "split dates", you apparently mean you want to create one observation per time period, where a period is either.
In particular you do not want to join the vacationless end of one work span with the vacationless start of the next work span.
You didn't say so, but I guess you want to allocate COUNTS value evenly for each day in the work span. And then, for the time periods above, add up those daily counts values.
I think the best strategy for this is to create a daily COUNTS history, in the form of an ARRAY, for the entire date range of the ID.
Also a daily LABELS history array, initialized with daily values of "?" for the first WORK observation, "!" for the second WORK observation, then back to "?" for the third etc. Then update the LABELS history with the labels from the VACATIONS dataset.
Once the histories are created, loop through them (from _min_date to _max_date) to create the new updated time spans.
In a test, I tweaked some of your data to consistently have start<=end. And to make vacations not overlap each other.
/* Use these macrovars to establish the needed upper/lower limits of the HISTORY arrays */
%let beg_study=01jan2015;
%let end_study=01jan2021; /* Actually one day AFTER end of study */
data want (drop=_:);
/* For each ID, read WORK first, then VACATIONS*/
set work (in=inw) vacations (in=inv) ;
by id;
array _count_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} _temporary_;
array _label_history{%sysevalf("&beg_study"d):%sysevalf("&end_study"d)} $1 _temporary_;
format _min_date _max_date _d date9.;
if first.id then do;
call missing(of _count_history{*},of _label_history{*},_max_date);
_min_date=Start;
_initial_label='?';
end;
retain _min_date _max_date _initial_label;
if inw then do;
_count_per_day=counts/(1+End-Start);
_max_date=End;
do _d=Start to End;
_count_history{_d}=_count_per_day;
_label_history{_d}=_initial_label;
end;
if _initial_label='?' then _initial_label='!';
else _initial_label='?';
end;
else if inv then do _d=Start to End;
_label_history{_d}=Label;
end;
if last.id;
do Start=_min_date by 0 until (Start>_max_date);
label=_label_history{Start};
counts_resized=0;
do End=Start by 1 until (_label_history{End+1}^=label);
counts_resized=counts_resized+_count_history{End};
end;
if label='!' then label='?';
output;
Start=End+1;
end;
run;
This program assumes that the first WORK observation is the earliest, and the last WORK obs is the latest. It actually doesn't matter whether the intermediate WORK obs are also sorted by date. And it doesn't matter whether VACATIONS are sorted within an ID.
And if vacations cross work boundaries, this program keeps the vacation as a single time span.
Yes, basically I have to assign counts only to days out of vacations, i.e. working days. I think you addressed the point. Thank you for your help!
data work; input ID :$20. Start :date9. End :date9. Counts ; count_per_day=counts/(1+End-Start); format start end yymmdd10.; cards; 0001 01JAN2015 18JAN2015 32499 0001 19JAN2015 30APR2015 23458 0001 01MAY2015 13JAN2016 987 0002 06FEB2019 15FEB2019 3567 0002 16FEB2019 31DEC2019 1254 0003 01JAN2020 18MAR2020 652 0003 19MAR2020 02APR2020 765 0003 03APR2020 31DEC2020 200 ; data vacations; input ID :$20. Start :date9. End :date9. Label :$20.; format start end yymmdd10.; cards; 0001 03JAN2015 10JAN2015 A 0001 21JAN2015 21JAN2015 B 0001 03MAY2015 09JAN2016 C 0002 08FEB2019 08FEB2019 B 0002 09FEB2019 15FEB2019 A 0003 20FEB2020 25FEB2020 C 0003 21MAR2020 21MAR2020 B 0003 04JAN2020 10JAN2020 A ; proc summary data=work; by id; var start end; output out=temp(drop=_:) min(start)=start max(end)=end; run; data all; set temp; do date=start to end; output; end; format date yymmdd10.; keep id date ; run; data count_per_day; set work; do date=start to end; output; end; format date yymmdd10.; keep id date count_per_day; run; data k(rename=(end=date)); set work(keep=id End); run; data key(index=(x=(id date))); set vacations; do date=start to end; output; end; format date yymmdd10.; keep id date Label; run; data all2; if _n_=1 then do; if 0 then set count_per_day; declare hash c(dataset:'count_per_day'); c.definekey('id','date'); c.definedata('count_per_day'); c.definedone(); if 0 then set k; declare hash h(dataset:'k'); h.definekey('id','date'); h.definedone(); end; merge all key; by id date; call missing(count_per_day); rc=c.find(); if h.check()=0 then label2='X'; drop rc; run; data all3; set all2; if id ne lag(id) or label ne lag(label) or lag(label2)='X' then group+1; run; proc sql; create table want as select id,min(date) as start format=date9.,max(date) as end format=date9.,max(count_per_day) as count_per_day,max(label) as label from all3 group by id,group; quit; data want; set want; if not missing(label) then count_per_day=0; run;
Save $250 on SAS Innovate and get a free advance copy of the new SAS For Dummies book! Use the code "SASforDummies" to register. Don't miss out, May 6-9, in Orlando, Florida.
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.