I have two datasets, each with an ID, START_TS, END_TS, and a Value. I want to merge these two and create a merged time intervals, and replace the original Value from dataset 1 with the Value from dataset 2. The logic should involve overlaying the values from dataset2
onto dataset1
where the time intervals from dataset2
exist, while maintaining the original dataset1
values for non-overlapping intervals.
data dataset1;
format START_TS END_TS datetime20.;
input OBS ID $ START_TS :datetime. END_TS :datetime. Value;
datalines;
1 A 01JAN2025:00:00:00 02JAN2025:00:00:00 10
2 A 02JAN2025:00:00:00 03JAN2025:00:00:00 20
3 A 03JAN2025:00:00:00 04JAN2025:00:00:00 10
4 A 04JAN2025:00:00:00 05JAN2025:00:00:00 20
5 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10
6 A 06JAN2025:00:00:00 07JAN2025:00:00:00 20
;
run;
data dataset2;
format START_TS END_TS datetime20.;
input OBS PT $ START_TS :datetime. END_TS :datetime. Value;
datalines;
1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99
2 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99
3 A 03JAN2025:22:00:00 04JAN2025:02:00:00 99
4 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99
5 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99
;
run;
Dataset wanted:
ID PT START_TS END_TS Updated_Value
1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99
2 A 01JAN2025:10:00:00 02JAN2025:00:00:00 10
3 A 02JAN2025:00:00:00 02JAN2025:08:00:00 20
4 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99
5 A 02JAN2025:11:00:00 03JAN2025:00:00:00 20
6 A 03JAN2025:00:00:00 03JAN2025:22:00:00 10
7 A 03JAN2025:22:00:00 04JAN2025:00:00:00 99
8 A 04JAN2025:00:00:00 04JAN2025:02:00:00 99
9 A 04JAN2025:02:00:00 05JAN2025:00:00:00 20
10 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10
11 A 06JAN2025:00:00:00 06JAN2025:03:00:00 20
12 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99
13 A 06JAN2025:05:00:00 06JAN2025:23:00:00 20
14 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99
My code is below. It mostly works, but does not handle multiple time intervals on the same day from dataset 2 (OBS 4 and 5) correctly. Please help!
CODE EDITED:
data dataset1; format START_TS END_TS datetime20.; input OBS ID $ START_TS :datetime. END_TS :datetime. Value; datalines; 1 A 01JAN2025:00:00:00 02JAN2025:00:00:00 10 2 A 02JAN2025:00:00:00 03JAN2025:00:00:00 20 3 A 03JAN2025:00:00:00 04JAN2025:00:00:00 10 4 A 04JAN2025:00:00:00 05JAN2025:00:00:00 20 5 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10 6 A 06JAN2025:00:00:00 07JAN2025:00:00:00 20 ; run; data dataset2; format START_TS END_TS datetime20.; input OBS PT $ START_TS :datetime. END_TS :datetime. Value; datalines; 1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99 2 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99 3 A 03JAN2025:22:00:00 04JAN2025:02:00:00 99 4 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99 5 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99 ; run; data temp1; set dataset1; do dt= START_TS to END_TS; output; end; keep id dt value; format dt datetime20.; run; data temp2; set dataset2; do dt= START_TS to END_TS; output; end; keep pt dt value; format dt datetime20.; run; data temp3; merge temp1 temp2(rename=(pt=id value=_value)); by id dt; new_value=coalesce(_value,value); run; proc summary data=temp3; by id new_value notsorted; var dt; output out=want(drop=_:) min=start_ts max=end_ts; run;
@Ksharp Thanks for your solution! Your code inspired me. The only issue is that I only showed a small sample dataset here. My real datasets have more than 20K days, so I believe SAS couldn't handle long formats in seconds (>20000x24x60x60) very well... Any thoughts?
Here is the third solution:
data dataset1; format START_TS END_TS datetime20.; input OBS ID $ START_TS :datetime. END_TS :datetime. Value; datalines; 1 A 01JAN2025:00:00:00 02JAN2025:00:00:00 10 2 A 02JAN2025:00:00:00 03JAN2025:00:00:00 20 3 A 03JAN2025:00:00:00 04JAN2025:00:00:00 10 4 A 04JAN2025:00:00:00 05JAN2025:00:00:00 20 5 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10 6 A 06JAN2025:00:00:00 07JAN2025:00:00:00 20 ; run; data dataset2; format START_TS END_TS datetime20.; input OBS PT $ START_TS :datetime. END_TS :datetime. Value; datalines; 1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99 2 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99 3 A 03JAN2025:22:00:00 04JAN2025:02:00:00 99 4 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99 5 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99 ; run; data temp1; set dataset1; do dt= START_TS to END_TS; output; end; keep id dt value; format dt datetime20.; run; data temp2; set dataset2; do dt= START_TS to END_TS; output; end; keep pt dt value; format dt datetime20.; run; data temp3; set temp1 temp2(rename=(pt=id)); by id dt; if last.dt; run; proc summary data=temp3; by id value notsorted; var dt; output out=want(drop=_:) min=start_ts max=end_ts; run;
This would be more manageable if you had the DATE of start_ts (which is a datetime value). Then it is more trivial to propagate value to the beginning or end of a date. So I made a VIEW of dataset1 and dataset2, with a new DATE variable.
Building NEED2, the view derived from DATASET2, each observation is not only copied, but all "holes" are also output, with appropriate START_DS and END_DS, and VALUE set to missing. "Filling holes" means not only between observations, but also start-of-day and end-of-day time spans not covered in the dataset2 observation.
data dataset1;
input OBS ID $ START_TS :datetime. END_TS :datetime. Value;
format start_ts end_ts datetime20.;
datalines;
1 A 01JAN2025:00:00:00 02JAN2025:00:00:00 10
2 A 02JAN2025:00:00:00 03JAN2025:00:00:00 20
3 A 03JAN2025:00:00:00 04JAN2025:00:00:00 10
4 A 04JAN2025:00:00:00 05JAN2025:00:00:00 20
5 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10
6 A 06JAN2025:00:00:00 07JAN2025:00:00:00 20
run;
data dataset2;
input OBS ID $ START_TS :datetime. END_TS :datetime. Value;
format START_TS END_TS datetime20.;
datalines;
1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99
2 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99
3 A 03JAN2025:22:00:00 04JAN2025:02:00:00 99
4 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99
5 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99
run;
data need2 (drop=_: obs) /view=need2 ;
set dataset2 ;
by ID;
/* When starting new day or new ID, check if dummy starting record is needed */
if first.ID=1 or dif(datepart(start_ts))>0 then do;
if timepart(start_ts) ^= '00:00:00't then do; /* Yes, a Dummy rec is needed*/
end_ts=start_ts;
start_ts=intnx('dtday',start_ts,0,'begin');
date=datepart(start_ts);
_value=value;
call missing(value);
output;
value=_value;
end;
end;
merge dataset2 /*Reread the same obs, to recover VALUE*/
dataset2 (firstobs=2 keep=id start_ts rename=(id=_nxt_id start_ts=_nxt_st)) ;
_orig_end=end_ts;
do until (start_ts=_orig_end);
date=datepart(start_ts);
end_ts=min(_orig_end,intnx('dtday',start_ts,1,'begin'));
output;
start_ts=end_ts;
end;
/* Most obs will need a trailing dummy record */
if timepart(end_ts)^='00:00:00't then do;
start_ts=end_ts;
end_ts=_nxt_st;
if last.ID or datepart(_nxt_st)^=datepart(start_ts) then end_ts=intnx('dtday',start_ts,1,'begin');
call missing(value);
date=datepart(start_ts);
output;
end;
format date date9. ;
run;
data need1 (drop=obs)/view=need1;
set dataset1;
date=datepart(start_ts);
format date date9. ;
run;
data want (drop=_:);
set need1 (in=in1) need2 (in=in2);
by ID date;
retain _value;
if first.date=1 and last.date=1 then output;
else if in1 then _value=value;
else if in2 then do;
value=coalesce(value,_value);
output;
end;
run;
No need to recreate every possible time slice. Using the existing ones should be enough.
This seems to work.
data DATASET1;
format START_TS END_TS datetime20.;
input OBS ID $ START_TS :datetime. END_TS :datetime. VALUE;
datalines;
1 A 01JAN2025:00:00:00 02JAN2025:00:00:00 10
2 A 02JAN2025:00:00:00 03JAN2025:00:00:00 20
3 A 03JAN2025:00:00:00 04JAN2025:00:00:00 10
4 A 04JAN2025:00:00:00 05JAN2025:00:00:00 20
5 A 05JAN2025:00:00:00 06JAN2025:00:00:00 10
6 A 06JAN2025:00:00:00 07JAN2025:00:00:00 20
run;
data DATASET2;
format START_TS END_TS datetime20.;
input OBS PT $ START_TS :datetime. END_TS :datetime. VALUE;
datalines;
1 A 01JAN2025:00:00:00 01JAN2025:10:00:00 99
2 A 02JAN2025:08:00:00 02JAN2025:11:00:00 99
3 A 03JAN2025:22:00:00 04JAN2025:02:00:00 99
4 A 06JAN2025:03:00:00 06JAN2025:05:00:00 99
5 A 06JAN2025:23:00:00 07JAN2025:00:00:00 99
run;
proc sql ;
create table PERIODS_START as
select unique START_TS , monotonic() as AA from
(select START_TS from DATASET2
union
select END_TS from DATASET2
union
select START_TS-timepart(START_TS) from DATASET2
union
select END_TS -timepart(END_TS ) from DATASET2
union
select intnx('dtday', START_TS, 1, 'b') from DATASET2
union
select intnx('dtday', END_TS , 1, 'b') from DATASET2)
order by START_TS;
quit;
data PERIODS_START_END;
merge PERIODS_START
PERIODS_START(firstobs=2 rename=(START_TS=END_TS));
if END_TS;
run;
proc sql;
select b.START_TS
, b.END_TS
, coalesce(c.VALUE, a.VALUE) as VALUE
from DATASET1 a
left join
PERIODS_START_END b
on a.START_TS <= b.START_TS < a.END_TS
left join
DATASET2 c
on b.START_TS between c.START_TS and c.END_TS
and b.END_TS between c.START_TS and c.END_TS
order by 1;
quit;
Thanks to all of you @Ksharp @ChrisNZ @mkeintz who provided suggestions! Appreciate your time!
I should have clarified that my real datasets do have HH:MM:SS values (in fact dataset2 TS rarely ends with 00:00:00) - I just didn't make those in my dummy ones. Also my dataset1 doesn't span days, but my dataset 2 potentially does.
I finally wrote up my own code with more complicated test datasets, with the help of ChatGPT... This is what I got. The code is long but it seems to work for now. If you have better solutions, let me know! Thank you!
/* Step 1: Create sample data */
data dataset1;
format START_TS END_TS datetime20.;
input ID $ START_TS :datetime. END_TS :datetime. Value;
datalines;
A 01JAN2025:00:00:00 01JAN2025:06:00:11 10
A 01JAN2025:06:00:11 01JAN2025:16:22:33 20
A 01JAN2025:16:22:33 02JAN2025:00:00:00 10
A 02JAN2025:00:00:00 03JAN2025:00:00:00 20
A 03JAN2025:00:00:00 04JAN2025:00:00:00 10
A 04JAN2025:00:00:00 05JAN2025:00:00:00 20
A 05JAN2025:00:00:00 06JAN2025:00:00:00 10
A 06JAN2025:00:00:00 07JAN2025:00:00:00 20
B 01JAN2025:00:00:00 01JAN2025:06:44:11 10
B 01JAN2025:14:33:22 01JAN2025:16:51:23 20
B 01JAN2025:18:18:18 02JAN2025:00:00:00 10
B 02JAN2025:00:00:00 02JAN2025:15:23:04 20
;
run;
data dataset2;
format START_TS END_TS datetime20.;
input ID $ START_TS :datetime. END_TS :datetime. Value;
datalines;
A 01JAN2025:00:00:00 01JAN2025:10:00:00 99
A 02JAN2025:08:00:00 04JAN2025:02:00:00 99
A 04JAN2025:15:00:00 04JAN2025:17:00:00 99
A 04JAN2025:22:00:00 04JAN2025:23:00:00 99
A 06JAN2025:03:00:00 06JAN2025:05:00:00 99
A 06JAN2025:23:00:00 07JAN2025:00:00:00 99
B 01JAN2025:00:00:00 01JAN2025:10:00:00 99
B 01JAN2025:15:00:00 01JAN2025:19:00:00 99
B 02JAN2025:08:00:00 04JAN2025:02:00:00 99
;
run;
/* Step 2: Adjust dataset1 intervals (if necessary) */
/*data adjusted_dataset1;*/
/* set dataset1;*/
/* format START_TS END_TS datetime20.;*/
/*run;*/
/* Step 3: Expand dataset2 into daily intervals */
data expanded_dataset2;
set dataset2;
format daily_start daily_end datetime20.;
/* Split dataset2 into daily intervals */
if not missing(START_TS) and not missing(END_TS) then do;
do date = DATEPART(START_TS) to DATEPART(END_TS);
daily_start = max(START_TS, dhms(date, 0, 0, 0));
daily_end = min(END_TS, dhms(date+1, 0, 0, 0));
output;
end;
end;
keep ID daily_start daily_end Value;
rename daily_start = START_TS daily_end = END_TS;
run;
/* Step 4: Merge and handle overlaps */
proc sql;
create table merged_intervals as
select a.ID,
a.START_TS as Base_START_TS,
a.END_TS as Base_END_TS,
b.START_TS as Overlay_START_TS,
b.END_TS as Overlay_END_TS,
a.Value as Base_Value,
b.Value as Overlay_Value
from dataset1 as a
left join expanded_dataset2 as b
on a.ID = b.ID and
b.END_TS > a.START_TS and
b.START_TS < a.END_TS
order by a.ID, a.START_TS, b.START_TS;
quit;
data merged_intervals1;
set merged_intervals;
Lag_Overlay_END_TS = lag(Overlay_END_TS);
Lag_ID1 = lag(ID);
If ID^= Lag_ID1 or DATEPART(Lag_Overlay_END_TS) ^= DATEPART(Overlay_START_TS) then Lag_Overlay_END_TS = .;
format Lag_Overlay_END_TS datetime20.;
run;
Proc sort data=merged_intervals1; by ID descending Base_START_TS descending Overlay_START_TS; run;
data merged_intervals2;
set merged_intervals1;
Lag_Overlay_START_TS = lag(Overlay_START_TS);
Lag_ID2 = lag(ID);
If ID^= Lag_ID2 or DATEPART(Lag_Overlay_START_TS) ^= DATEPART(Overlay_END_TS) then Lag_Overlay_START_TS = .;
format Lag_Overlay_START_TS datetime20.;
run;
Proc sort data=merged_intervals2; by ID Base_START_TS Overlay_START_TS; run;
/* Step 5: Split and prioritize intervals */
data final_intervals;
set merged_intervals2;
format START_TS END_TS datetime20.;
/* Interval 1: Before the overlap */
if Overlay_START_TS ^=. and Base_START_TS < Overlay_START_TS and Lag_Overlay_END_TS ^= . then do;
START_TS = max(Base_START_TS, Lag_Overlay_END_TS);
END_TS = min(Base_END_TS, Overlay_START_TS);
Value = Base_Value;
output;
end;
if Overlay_START_TS ^=. and Base_START_TS < Overlay_START_TS and Lag_Overlay_END_TS = . then do;
START_TS = Base_START_TS;
END_TS = min(Base_END_TS, Overlay_START_TS);
Value = Base_Value;
output;
end;
/* Interval 2: During the overlap */
if Overlay_START_TS ^=. then do;
START_TS = max(Base_START_TS, Overlay_START_TS);
END_TS = min(Base_END_TS, Overlay_END_TS);
Value = Overlay_Value;
output;
end;
/* Interval 3: After the overlap */
if Overlay_START_TS ^=. and Base_END_TS > Overlay_END_TS and Lag_Overlay_START_TS ^= . then do;
START_TS = Overlay_END_TS;
END_TS = min(Base_END_TS, Lag_Overlay_START_TS);
Value = Base_Value;
output;
end;
if Overlay_START_TS ^=. and Base_END_TS > Overlay_END_TS and Lag_Overlay_START_TS = . then do;
START_TS = Overlay_END_TS;
END_TS = Base_END_TS;
Value = Base_Value;
output;
end;
/* If no overlay exists, keep original interval */
if Overlay_START_TS =. then do;
START_TS = Base_START_TS;
END_TS = Base_END_TS;
Value = Base_Value;
output;
end;
run;
/* Step 6: Remove duplicates and sort */
proc sort data=final_intervals nodupkey;
by ID START_TS END_TS Value;
run;
/* Step 7: Print results */
proc print data=final_intervals(keep=ID START_TS END_TS Value) noobs;
format START_TS END_TS datetime20.;
run;
Output:ID | START_TS | END_TS | Value |
---|---|---|---|
A | 01JAN2025:00:00:00 | 01JAN2025:06:00:11 | 99 |
A | 01JAN2025:06:00:11 | 01JAN2025:10:00:00 | 99 |
A | 01JAN2025:10:00:00 | 01JAN2025:16:22:33 | 20 |
A | 01JAN2025:16:22:33 | 02JAN2025:00:00:00 | 10 |
A | 02JAN2025:00:00:00 | 02JAN2025:08:00:00 | 20 |
A | 02JAN2025:08:00:00 | 03JAN2025:00:00:00 | 99 |
A | 03JAN2025:00:00:00 | 04JAN2025:00:00:00 | 99 |
A | 04JAN2025:00:00:00 | 04JAN2025:02:00:00 | 99 |
A | 04JAN2025:02:00:00 | 04JAN2025:15:00:00 | 20 |
A | 04JAN2025:15:00:00 | 04JAN2025:17:00:00 | 99 |
A | 04JAN2025:17:00:00 | 04JAN2025:22:00:00 | 20 |
A | 04JAN2025:22:00:00 | 04JAN2025:23:00:00 | 99 |
A | 04JAN2025:23:00:00 | 05JAN2025:00:00:00 | 20 |
A | 05JAN2025:00:00:00 | 06JAN2025:00:00:00 | 10 |
A | 06JAN2025:00:00:00 | 06JAN2025:03:00:00 | 20 |
A | 06JAN2025:03:00:00 | 06JAN2025:05:00:00 | 99 |
A | 06JAN2025:05:00:00 | 06JAN2025:23:00:00 | 20 |
A | 06JAN2025:23:00:00 | 07JAN2025:00:00:00 | 99 |
B | 01JAN2025:00:00:00 | 01JAN2025:06:44:11 | 99 |
B | 01JAN2025:14:33:22 | 01JAN2025:15:00:00 | 20 |
B | 01JAN2025:15:00:00 | 01JAN2025:16:51:23 | 99 |
B | 01JAN2025:18:18:18 | 01JAN2025:19:00:00 | 99 |
B | 01JAN2025:19:00:00 | 02JAN2025:00:00:00 | 10 |
B | 02JAN2025:00:00:00 | 02JAN2025:08:00:00 | 20 |
B | 02JAN2025:08:00:00 | 02JAN2025:15:23:04 | 99 |
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.