I have the following dataset tthat I need to pull out or break overlap date/time
HAVE:
id | start_date_time | stop_date_time |
1 | 3/5/2020 0:00 | 3/5/2020 23:12 |
1 | 3/5/2020 9:20 | 3/5/2020 9:45 |
WANT:
id | start_date_time | stop_date_time |
1 | 3/5/2020 0:00 | 3/5/2020 9:20 |
1 | 3/5/2020 9:20 | 3/5/2020 9:45 |
1 | 3/5/2020 9:45 | 3/5/2020 23:12 |
I am not sure how to do this and I have been using this as a source but after writing the code, it didn't work. Any help would be greately appreciated! http://support.sas.com/resources/papers/proceedings09/079-2009.pdf
Hi @radhikaa4 ,
maybe like that (mind the example is on dates not timestamps, but you can adjust it):
data have;
input Start : date9. End : date9.;
format Start mmddyy10. End mmddyy10.;
cards;
30JUN2017 30JUN2018
01JAN2018 01JAN2019
30JUN2018 30JUN2019
01JAN2019 01JAN2020
;
run;
data have2;
set
have(keep =Start rename=(Start=date))
have(keep =End rename=(End =date))
;
run;
proc sort data = have2;
by Date;
run;
data want;
set have2;
lag_date = lag(date);
if _N_ > 1;
start = date;
end = lag_date;
format Start mmddyy10. End mmddyy10.;
keep start end;
run;
proc print;
run;
All the best
Bart
Hi @yabwon . Thanks. How do I include ID in it? I tried keep= subjectId
Hi @radhikaa4 ,
Sorry, I missed the ID, try this:
data have;
input ID Start : date9. End : date9.;
format Start mmddyy10. End mmddyy10.;
cards;
1 30JUN2017 30JUN2018
1 01JAN2018 01JAN2019
1 30JUN2018 30JUN2019
1 01JAN2019 01JAN2020
2 30JUN2017 30JUN2018
2 01JAN2018 01JAN2019
3 30JUN2018 30JUN2019
3 01JAN2019 01JAN2020
;
run;
data have2;
set
have(keep =Start ID rename=(Start=date))
have(keep =End ID rename=(End =date))
;
run;
proc sort data = have2;
by ID Date;
run;
data want;
set have2;
by ID;
lag_date = lag(date);
if not first.ID ;
start = date;
end = lag_date;
format Start mmddyy10. End mmddyy10.;
keep id start end;
run;
proc print;
run;
All the best
Bart
data have;
infile cards expandtabs;
input id start_date_time & anydtdtm. stop_date_time & anydtdtm.;
format start_date_time stop_date_time mdyampm.;
cards;
1 3/5/2020 0:00 3/5/2020 23:12
1 3/5/2020 9:20 3/5/2020 9:45
;
data temp;
set have;
n+1;
do temp=start_date_time to stop_date_time;
output;
end;
drop start_date_time stop_date_time;
run;
proc sort data=temp;
by id temp;
run;
data temp1;
do until(last.temp);
set temp;
by id temp;
length x $ 40;
x=cats(x,n);
end;
run;
data want;
set temp1;
by id x notsorted;
retain start;
if first.x then start=temp;
if last.x then do;end=temp;output;end;
format start end mdyampm.;
drop n x temp;
run;
I'm working on a medication data set with overlapped period and daily dose for each record
data test;
infile datalines delimiter = ',';
input solddt: date9. rxsup dose med_end_dt: date9. cluster ;
format solddt yymmdd10. med_end_dt yymmdd10.;
datalines;
10sep2009, 16, 200, 26sep2009, 1
02oct2009, 2, 45, 04oct2009, 1
06oct2009, 66, 90.91, 11dec2009, 1
12oct2009, 16, 200, 28oct2009, 1
21oct2009, 12, 62.5, 02nov2009, 1
03nov2009, 33, 181.82, 06dec2009, 1
12nov2009, 2, 45, 14nov2009, 1
12nov2009, 16, 200, 28nov2009, 1
04dec2009, 33, 181.82, 06jan2010, 1
14dec2009, 25, 128, 08jan2010, 1
17dec2009, 2, 45, 09dec2009, 1
15jan2010, 5, 45, 20jan2010, 2
18feb2010, 25, 90, 15mar2010, 3
;
run;
How would you implement a new variable newdailydose which is the sum of daily dose during the overlapped periods after the records are separated?
Also, please explain how x is concatenated with n and x in the do loop in your code
Hi,
do you mean something like this:
data test2;
set test;
do date = solddt to med_end_dt;
output;
end;
keep date dose cluster;
run;
proc sql;
create table test3 as
select cluster, date format yymmdd10., sum(dose) as sumDose
from test2
group by cluster, date
order by cluster, date
;
run;
data test4;
do until(last.sumDose);
set test3;
by cluster sumDose notsorted;
if first.sumDose then solddt = date;
if last.sumDose then
do;
med_end_dt = date;
dose = sumDose;
end;
end;
drop sumDose date;
format solddt yymmdd10. med_end_dt yymmdd10.;
run;
proc print;
run;
?
All the best
Bart
OK. Plz post the output you are looking for in order to better understand your question.
Where there was overlapped date , one date or obs correspond to many obs after separated it one by one
,like
cluster n temp
1 1 21oct2020
1 2 21oct2020
here N represent the row number(1 is the first row ,2 is the second row)
" please explain how x is concatenated with n and x in the do loop in your code"
do until(last.temp);
set temp;
by cluster temp;
length x $ 40;
x=cats(x,n);
end;
It process one group (cluster and temp are group variables) data. (e.g. do until(last.temp) )
x concated the row number within the same group.
firstly, x= the first n (1),since x is missing value.
secondly, x= the first and second n ( 12 ),since x is RETAIN in the same data step loop.
.........
until the last.temp ,and output the last group value .if data like above (only two obs), the output is
cluster n temp x
1 2 21oct2020 12
Here x identity which obs has 21oct2020 , (the first obs and second one)
Here x also identity the during of overlap date.
cluster n temp x
1 2 21oct2020 12
1 2 22oct2020 12
1 2 23oct2020 12
The last data step could get this:
cluster n temp x start end
1 2 23oct2020 12 21oct2020 23oct2020
data test;
infile datalines delimiter = ',';
input solddt: date9. rxsup dose med_end_dt: date9. cluster ;
format solddt yymmdd10. med_end_dt yymmdd10.;
datalines;
10sep2009, 16, 200, 26sep2009, 1
02oct2009, 2, 45, 04oct2009, 1
06oct2009, 66, 90.91, 11dec2009, 1
12oct2009, 16, 200, 28oct2009, 1
21oct2009, 12, 62.5, 02nov2009, 1
03nov2009, 33, 181.82, 06dec2009, 1
12nov2009, 2, 45, 14nov2009, 1
12nov2009, 16, 200, 28nov2009, 1
04dec2009, 33, 181.82, 06jan2010, 1
14dec2009, 25, 128, 08jan2010, 1
17dec2009, 2, 45, 09dec2009, 1
15jan2010, 5, 45, 20jan2010, 2
18feb2010, 25, 90, 15mar2010, 3
;
run;
data temp;
set test;
n+1;
do temp=solddt to med_end_dt;
output;
end;
drop solddt med_end_dt rxsup;
format temp date9.;
run;
proc sort data=temp;
by cluster temp;
run;
data temp1;
sum_dose=0;
do until(last.temp);
set temp;
by cluster temp;
length x $ 40;
x=cats(x,n);
sum_dose+dose;
end;
drop dose;
run;
data want;
set temp1;
by cluster x notsorted;
retain start;
if first.x then start=temp;
if last.x then do;end=temp;output;end;
format start end date9.;
drop n x temp;
run;
Hello! The output I want is
t0 | dose | t1 | New daily dose | Cluster |
10-Sep-09 | 200 | 9/26/2009 | 200 | 1 |
2-Oct-09 | 45 | 10/4/2009 | 45 | 1 |
6-Oct-09 | 90.91 | 10/11/2009 | 90.91 | 1 |
12-Oct-09 | 200 | 10/20/2009 | 290.91 | 1 |
21-Oct-09 | 62.5 | 10/28/2009 | 353.41 | 1 |
10/29/2009 | 11/2/2009 | 153.41 | 1 | |
3-Nov-09 | 181.82 | 11/11/2009 | 272.73 | 1 |
12-Nov-09 | 200 | 11/14/2009 | 335.91 | 1 |
11/15/2009 | 16 | 11/28/2009 | 290.91 | 1 |
11/29/2020 | 12/3/2009 | 272.73 | 1 | |
4-Dec-09 | 181.82 | 12/6/2009 | 454.55 | 1 |
12/7/2009 | 12/11/2009 | 272.73 | 1 | |
12/12/2009 | 12/13/2009 | 181.82 | 1 | |
14-Dec-09 | 25 | 12/16/2009 | 309.82 | 1 |
17-Dec-09 | 45 | 12/19/2009 | 354.82 | 1 |
20-Dec-09 | 1/6/2010 | 309.82 | 1 | |
1/7/2010 | 1/8/2010 | 128 | 1 | |
15-Jan-10 | 1/20/2010 | 45 | 2 | |
18-Feb-10 | 3/12/2010 | 90 | 3 |
@Ksharp Yes! Thank you for the explanation. It was very helpful in understanding how the program works. My concern my concern now is that this program is not robust for large data sets. I am possibly looking at millions of records with potentially long dispense duration periods for different subjects. Any ideas on how I can make the program robust?
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.