Hi guys,
I'd like to mark the patients who had medical insurance when he/she was diagnosed (date_diagnosis) with the disease under the study (KEEP2=YES). All patients have insurance on and off basis, however, some patients had no insurance coverage when the disease was diagnosed (KEEP2=NO). My data has date pairs for the intervals of insurance coverage., ie., start1-end1 and patient lost the insurance and got it back during start2-end2 ...so forth so on. Desired output is the KEEP_CORRECT.
This problem was resolved by @mkeintz and worked on the data I had back then. However, the algorithm makes wrong choice as I
have shown in the image attached. The red stars point out where date_diagnosis actually fell in the intervals of insurance coverage
that I would mark KEEP2=YES but algorithm creates KEEP2=NO.
Could you please help fix the algorithm? or suggest alternative solutions, if any?
Another wrinkle is the repeated data. The algorithm works well with unique rows. However, my actual data has duplicates. I think I should create another forum once algorithm works for the data with distinct rows.
DATA ELIGIBILITY;
INPUT ID start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH YEAR KEEP_CORRECT $;
Informat start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH anydtdte.;
format start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH date9.;
CARDS;
1 1-Feb-04 28-Feb-06 1-Apr-07 31-May-10 1-Nov-10 31-Aug-13 1-Oct-14 30-Sep-15 1-Mar-16 31-Dec-16 4-Jan-06 . 2006 YES
2 1-Jan-02 31-Dec-04 1-Feb-09 30-Apr-10 1-Jul-10 31-Oct-10 1-Nov-12 31-Dec-16 1-Jul-17 30-Nov-17 23-Feb-09 . 2009 YES
3 1-May-05 31-Jul-05 . . . . . . . . 1-Mar-05 16633 2005 NO
4 1-Sep-05 30-Sep-05 . . . . . . . . 29-Jun-05 16701 2005 NO
5 1-Apr-05 30-Apr-07 . . . . . . . . 3-Jan-05 18629 2005 NO
6 1-Feb-05 31-Oct-05 . . . . . . . . 3-Jan-05 . 2005 NO
7 1-Apr-04 31-Jul-04 1-Jan-05 30-Jun-05 1-Aug-05 31-Dec-06 . . . . 4-Jan-05 16901 2005 YES
8 1-Sep-05 30-Nov-05 . . . . . . . . 5-Jan-05 16716 2005 NO
9 1-Jul-05 31-Dec-05 . . . . . . . . 7-Jan-05 17332 2005 NO
10 1-Jun-04 30-Jun-05 . . . . . . . . 13-Jan-05 16827 2005 YES
;
data want (drop=_:);
set eligibility;
array st {*} start: ;
array en {*} end: ;
_np=n(of st{*}); /* N of start/end pairs */
_upper_date=min(mdy(12,31,year),date_death); /* Ins must cover through this date */
keep2='YES'; /* Assume eligiblity */
if st{1} > mdy(1,1,year) then keep2='NO '; /*minimum start is too late ...*/
else if en{_np} < _upper_date then keep2='NO '; /*max end precedes upper date*/
/* Check every gap (i.e. end1:start2, end2:start3) preceding upper date */
else if _np>1 then do _i=1 to _np-1 while (en{_i}<_upper_date and keep2='YES');
if year(en{_i}+1)=year then keep2='NO';
else if year(st{_i+1}-1)=year then keep2='NO';
end;
run;
PROC PRINT DATA=WANT; RUN;
When working with date ranges using SQL with data organized in a long format makes things often much simpler.
Here how this could look like.
DATA wide;
INPUT ID start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH YEAR KEEP_CORRECT $;
Informat start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH anydtdte.;
format start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH date9.;
CARDS;
1 1-Feb-04 28-Feb-06 1-Apr-07 31-May-10 1-Nov-10 31-Aug-13 1-Oct-14 30-Sep-15 1-Mar-16 31-Dec-16 4-Jan-06 . 2006 YES
2 1-Jan-02 31-Dec-04 1-Feb-09 30-Apr-10 1-Jul-10 31-Oct-10 1-Nov-12 31-Dec-16 1-Jul-17 30-Nov-17 23-Feb-09 . 2009 YES
3 1-May-05 31-Jul-05 . . . . . . . . 1-Mar-05 16633 2005 NO
4 1-Sep-05 30-Sep-05 . . . . . . . . 29-Jun-05 16701 2005 NO
5 1-Apr-05 30-Apr-07 . . . . . . . . 3-Jan-05 18629 2005 NO
6 1-Feb-05 31-Oct-05 . . . . . . . . 3-Jan-05 . 2005 NO
7 1-Apr-04 31-Jul-04 1-Jan-05 30-Jun-05 1-Aug-05 31-Dec-06 . . . . 4-Jan-05 16901 2005 YES
8 1-Sep-05 30-Nov-05 . . . . . . . . 5-Jan-05 16716 2005 NO
9 1-Jul-05 31-Dec-05 . . . . . . . . 7-Jan-05 17332 2005 NO
10 1-Jun-04 30-Jun-05 . . . . . . . . 13-Jan-05 16827 2005 YES
;
run;
data long(drop=_: start: end:);
set wide;
array _start {*} start:;
array _end {*} end:;
attrib insurance_start_dt insurance_end_dt format=date9.;
do _i=1 to dim(_start);
if missing(_start[_i]) then continue;
insurance_start_dt=_start[_i];
insurance_end_dt=_end[_i];
output;
end;
run;
proc sql;
/* create table want as*/
select
*,
case
when
(select count(*) from long i where i.id=o.id and i.DATE_DIAGNOSIS between i.insurance_start_dt and i.insurance_end_dt) >0 then 'YES'
else 'NO'
end length=3
as KEEP_CORRECT2,
case
when
(select count(*) from long i where i.id=o.id and i.DATE_DIAGNOSIS between o.insurance_start_dt and o.insurance_end_dt) >0 then '1'
else '0'
end length=1
as match_flag
from long o
;
quit;
N.B: A look-up implemented within a SQL Select clause is from a performance perspective not ideal and you will always have to find the right balance between performance and simple, maintainable code.
When working with date ranges using SQL with data organized in a long format makes things often much simpler.
Here how this could look like.
DATA wide;
INPUT ID start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH YEAR KEEP_CORRECT $;
Informat start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH anydtdte.;
format start1 end1 start2 end2 start3 end3 start4 end4 start5 end5 DATE_DIAGNOSIS DATE_DEATH date9.;
CARDS;
1 1-Feb-04 28-Feb-06 1-Apr-07 31-May-10 1-Nov-10 31-Aug-13 1-Oct-14 30-Sep-15 1-Mar-16 31-Dec-16 4-Jan-06 . 2006 YES
2 1-Jan-02 31-Dec-04 1-Feb-09 30-Apr-10 1-Jul-10 31-Oct-10 1-Nov-12 31-Dec-16 1-Jul-17 30-Nov-17 23-Feb-09 . 2009 YES
3 1-May-05 31-Jul-05 . . . . . . . . 1-Mar-05 16633 2005 NO
4 1-Sep-05 30-Sep-05 . . . . . . . . 29-Jun-05 16701 2005 NO
5 1-Apr-05 30-Apr-07 . . . . . . . . 3-Jan-05 18629 2005 NO
6 1-Feb-05 31-Oct-05 . . . . . . . . 3-Jan-05 . 2005 NO
7 1-Apr-04 31-Jul-04 1-Jan-05 30-Jun-05 1-Aug-05 31-Dec-06 . . . . 4-Jan-05 16901 2005 YES
8 1-Sep-05 30-Nov-05 . . . . . . . . 5-Jan-05 16716 2005 NO
9 1-Jul-05 31-Dec-05 . . . . . . . . 7-Jan-05 17332 2005 NO
10 1-Jun-04 30-Jun-05 . . . . . . . . 13-Jan-05 16827 2005 YES
;
run;
data long(drop=_: start: end:);
set wide;
array _start {*} start:;
array _end {*} end:;
attrib insurance_start_dt insurance_end_dt format=date9.;
do _i=1 to dim(_start);
if missing(_start[_i]) then continue;
insurance_start_dt=_start[_i];
insurance_end_dt=_end[_i];
output;
end;
run;
proc sql;
/* create table want as*/
select
*,
case
when
(select count(*) from long i where i.id=o.id and i.DATE_DIAGNOSIS between i.insurance_start_dt and i.insurance_end_dt) >0 then 'YES'
else 'NO'
end length=3
as KEEP_CORRECT2,
case
when
(select count(*) from long i where i.id=o.id and i.DATE_DIAGNOSIS between o.insurance_start_dt and o.insurance_end_dt) >0 then '1'
else '0'
end length=1
as match_flag
from long o
;
quit;
N.B: A look-up implemented within a SQL Select clause is from a performance perspective not ideal and you will always have to find the right balance between performance and simple, maintainable code.
@Patrick Thanks a lot. What is match_flag here? do we need it?
@Cruise wrote:
@Patrick Thanks a lot. What is match_flag here? do we need it?
I've just added match_flag to mark the row with the matching time period and also to demonstrate how a small change in the SQL logic will yield a different result. If you don't need this information then just remove the logic from the SQL.
Using @Patrick 's example data (your step as posted does not work because of tabs, see the log), tranpose and then it's a very simple data step:
proc transpose
data=wide
out=start (
rename=(col1=start)
where=(start ne .)
)
;
by id;
var start:;
run;
data start1;
set start;
period = input(substr(_name_,6),best.);
drop _name_;
run;
proc transpose
data=wide
out=end (
rename=(col1=end)
where=(end ne .)
)
;
by id;
var end:;
run;
data end1;
set end;
period = input(substr(_name_,4),best.);
drop _name_;
run;
data trans;
merge
start1
end1
;
by id period;
run;
data want;
merge
wide (drop=start: end:)
trans
;
by id;
if start <= date_diagnosis <= end;
run;
The data eligibility step contains tabs in the datalines, and that's a delimiter that does not work if it is not set in an infile statement.
Hint from me: disable the use of tabs in the Enhanced Editor, and also set it to convert tabs to blanks on saving and loading.
@Kurt_Bremser @Patrick both approaches worked on my analytical data just so perfectly and solutions agreed. Thank you so much for your help.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.