Hello,
I have a table with patientID, admit date and diagnosis code. If the code is null, we look back 2 days to see if there is a valid code within the last two days. If that's null too, we look forward two days, then 5 days back, 5 days forward, and finally 9 days back. We do this each time we encounter a null value for the diagnosis code. I have tried doing a proc sql self join and using lag functions to lag by 2,5, and 9 but it did not help as the dates are not consecutive. So, for a null code, the sequence is to check date-2, then date+2, date-5, date+5 and finally date-9 until a valid code is found. I am using SAS Eg 8.1 Any inputs are greatly appreciated. Thank you!
Have:
PatientID Admit_dt diag
101 20160201 0
101 20160210 190
101 20160215 0
101 20160217 29
101 20160330 23
101 20160405 32
101 20160410 0
101 20160510 12
101 20160510 0
Want:
PatientID Admit_dt diag diag_new
101 2016-02-01 0 0
101 2016-02-10 190 190
101 2016-02-15 0 29
101 2016-02-17 29 29
101 2016-03-30 23 23
101 2016-04-05 32 32
101 2016-04-10 0 32
101 2016-05-10 12 12
101 2016-05-10 0 12
You could create an array RANK_HIST analogous to the DIAG_HIST array.
But you would still have to change the logic of the loop. Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.
D_MIN starts out as missing. As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D. But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.
At the end of each range, if D_MIN is not missing, no more ranges are checked. And finally if D_MIN is not missing, assign a value to NEW_DIAG.
Something like this (totally untested):
new_diag =coalesce(diag_hist{date},diag);
array strt {3} _temporary_ (1,3,6);
array finish {3} _temporary_ (2,5,9);
d_min=.;
if new_diag^=0 then do r=1 to 3 until (d_min^=.);
do sign=-1,1 until (d_min^=.);
do d = sign*strt{r} to sign*finish{r} by sign;
if diag_hist{date+d}^=0 then do;
if d_min=. then d_min=d; else
if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
end;
end;
end;
end;
if d_min^=. then new_diag=diag_hist{date+d_min};
Note the first inner loop ("do sign=-1,1 …") provides a way to set up this sequence of 6 ranges:
First thing to clarify: are those dates real sas dates? If not: fix this first. Second thing to note: there is a huge difference between null and zero, in the data you have posted there are no null-values. Do you mean zero instead of null? What exactly do you mean by "date+2"? Using the date in the first row, do you want to look for with Admit_dt = 20160203 or do you want to look two observations ahead?
Apologies for the wrong terminology. The code is numeric, so if it's a 0 or missing value, then it is considered invalid. Those are dates that I read in as yymddn8. By date+2, I mean looking forward till two days. So, for the first row, since it's the beginning and there's nothing to look back, I start by looking forward till 2 days, i.e., 20160201 - 20160203 and look at claims between the two dates. If there are claims, and any of them have a valid code, I pick that and copy its code and other values (rank, SOI etc) to the first row. If there are no claims or claims with missing or 0 code, then I look for claims between 20160201 and 20160206(i.e., date+5) and repeat the same process.
Please note that I am not jumping backward and forward by the number of rows, rather by the number of days. So, if there are 10 rows between 20160201 and 20160203, then I look at all the 10 rows for my date+2 check. Hope this clarifies my question further. Thank you!
I have added conditions for only look forward, you can add the look backward conditions too.
data have;
input PatientID Admit_dt:yymmdd8. diag ;
format PatientID diag best12. Admit_dt yymmdd10.;
id=_n_;
cards;
101 20160201 0
101 20160210 190
101 20160215 0
101 20160217 29
101 20160330 23
101 20160405 32
101 20160410 0
101 20160510 12
101 20160510 0
;
run;
data test(drop=Admit_dt diag);
format Admit_dt1 yymmdd10.;
set have;
Admit_dt1=Admit_dt;
diag1=diag;
run;
proc sql;
create table want as
select a.PatientID,a.id,Admit_dt,Admit_dt1,diag,diag1
from have a, test b
where a.PatientID=b.PatientID and diag1 ne 0
order by PatientID, id, Admit_dt1;
quit;
data want1(drop=Admit_dt1 diag1 id);
set want;
retain diag_new;
by PatientID id;
if first.id then diag_new=diag;
if first.id and diag ne 0 then output;
if diag_new=0 then do;
if Admit_dt1=Admit_dt and diag1 ne 0 then do;
diag_new=diag1;
output;
end;
else if Admit_dt1=Admit_dt+2 and diag1 ne 0 then do;
diag_new=diag1;
output;
end;
else if Admit_dt1=Admit_dt+5 and diag1 ne 0 then do;
diag_new=diag1;
output;
end;
else if Admit_dt1=Admit_dt+9 and diag1 ne 0 then do;
diag_new=diag1;
output;
end;
end;
if last.id and diag_new = 0 then output;
run;
Please let us know if this helped.
Thank you for your reply. Your code worked for a sample of my data, but when I run it for the whole table (around 5 million rows), it delves into producing a cartesian product. I am trying to troubleshoot and will update as soon as I can.
Usually looking forward and backward are incremented symmetrically in 1-day intervals, I.e. look at offsets d=-1,d=1, d=-2,d=2, etc. which is easy to embed inside a DO D=1 to 9 loop, using a sign change inside.
But you want this sequence:
d=-1,d=-2, d=1,d=2, d=-3,d=-4,d=-5, d=3,d=4,d=5 ...
so the loop definition has to be more detailed as here:
data have;
input PatientID Admit_dt :yymmdd8. diag;
format admit_dt date9.;
datalines;
101 20160201 0
101 20160210 190
101 20160215 0
101 20160330 23
101 20160405 32
101 20160410 0
101 20160510 12
101 20160510 0
;
%let begdate=01jan2016;
%let enddate=31dec2016;
data want (drop=d);
set have (where=(diag^=0) in=in_diag)
have (in=in2);
by patientid;
array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;
if first.patientid then call missing(of diag_hist{*});
if in_diag then diag_hist{admit_dt}=diag;
if in2;
if diag=0 then do d=0 ,-1,-2,1,2 ,-3,-4,-5,3,4,5 ,-6,-7,-8,-9,6,7,8,9;
if diag_hist{admit_dt+d}^=. then do;
new_diag=diag_hist{admit_dt+d};
leave;
end;
end;
if new_diag=. then new_diag=diag;
run;
Edited additional note. I guess one could somehow set up the loop iteration values using macrovars - especially if you want to modify the date-search cutpoints, currently at {-2,2}, {-5,5}, and {-9,9}, but for now I think that goes beyond the proper scope of this topic.
Thank you for the explanation. I will admit that I am rusty with arrays, so I will adopt this logic for my data and update here.
You can replace the loop:
if ^DIAG then NEW_DIAG=coalesce(
diag_hist{admit_dt-1}
,diag_hist{admit_dt-2}
,diag_hist{admit_dt+1}
,diag_hist{admit_dt+2}
,diag_hist{admit_dt-3}
,diag_hist{admit_dt-4}
,diag_hist{admit_dt-5}
,diag_hist{admit_dt+3}
,diag_hist{admit_dt+4}
,diag_hist{admit_dt+5}
,diag_hist{admit_dt-6}
,diag_hist{admit_dt-7}
,diag_hist{admit_dt-8}
,diag_hist{admit_dt-9)
);
for, maybe, better speed or better clarity.
Chris:
One of the OP's data rows has DIAG=0 but has a duplicate date of another with DIAG^=0. I think you'll need to insert "diag_hist(admit_dt}" at the front of the coalesce arguments.
Hi @mkeintz ,
I adopted your code for my data and it fails for one patientID when tested on a sample. The diag is a character field so I have made changes accordingly. I am getting the 'Array subscript out of range' error message and trying to figure that out. Please let me know what I'm doing wrong. Thank you!
data have;
set nonspec5;
where PatientID in ('4000052628862:40000526288622'
,'4000052628862:40000526288627'
,'4000122628869:40001226288692'
,'4000322628869:40003226288692'
);
format clistart date9.;
keep PatientID clistart validDiag ;
run;
%let begdate=01jan2012;
%let enddate=31dec2012;
data want (drop=d);
set have (where=(validDiag^=' ') in=in_diag)
have (in=in2);
by patientid;
array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;
if first.patientid then call missing(of diag_hist{*});
if in_diag then diag_hist{clistart}=validDiag;
if in2;
if validDiag=' ' then do d=0,-2,2,-5,5,-9;
if diag_hist{clistart+d}^=' ' then do;
new_diag=diag_hist{clistart+d};
leave;
end;
end;
if new_diag=' ' then new_diag=validDiag;
run;
Here's my log:
28 data want (drop=d);
29 set have (where=(validDiag^=' ') in=in_diag)
30 have (in=in2);
31 by patientid;
32
33 array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;
34
35 if first.patientid then call missing(of diag_hist{*});
36 if in_diag then diag_hist{clistart}=validDiag;
37
38 if in2;
39 if validDiag=' ' then do d=0,-2,2,-5,5,-9;
40 if diag_hist{clistart+d}^=' ' then do;
41 new_diag=diag_hist{clistart+d};
42 leave;
43 end;
44 end;
45 if new_diag=' ' then new_diag=validDiag;
46 run;
ERROR: Array subscript out of range at line 40 column 8.
in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9
new_diag= _ERROR_=1 _N_=45
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 14 observations read from the data set WORK.HAVE.
WHERE validDiag not = ' ';
NOTE: There were 32 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 30 observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.
Read the log, it's all there! 🙂
ERROR: Array subscript out of range at line 40 column 8.
Line 40 is:
40 if diag_hist{clistart+d}^=' ' then do;
The values are:
in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9 new_diag= _ERROR_=1 _N_=45
06JAN2012 - 9 is outside of the array bounds.
Either have an array then spans more then a year, or test the index value before trying to read the array.
Which is exactly was the message reads: Array subscript out of range
You have an observation on the 6th of Jan in 2016. The loop searches up to 9 days before (ie. late December 2015). So either establish an earlier lower bound for the array, or test for date exceptions.
@mkeintz Thank you! Now I understand. One last question: Here, we are looking at a particular point of data in the array, how can I modify it to look at a range? For example, for date 15feb2016, at d=-5, we are looking at 10feb2016 in the array. Instead, how can I look at all the claims that exist between 15feb2016 and 10feb2016 (i.e., on 15,14,13,12,11, and 10feb2016)? I ask because for d=-5, I look at all claims within the 5 days and pick the one with the least Rank( a variable I pull from a different table). Please let me know if this is out of scope and I need to post this as a separate question. I'm sorry that my question was not clear from the get go, will do better henceforth.
You could create an array RANK_HIST analogous to the DIAG_HIST array.
But you would still have to change the logic of the loop. Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.
D_MIN starts out as missing. As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D. But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.
At the end of each range, if D_MIN is not missing, no more ranges are checked. And finally if D_MIN is not missing, assign a value to NEW_DIAG.
Something like this (totally untested):
new_diag =coalesce(diag_hist{date},diag);
array strt {3} _temporary_ (1,3,6);
array finish {3} _temporary_ (2,5,9);
d_min=.;
if new_diag^=0 then do r=1 to 3 until (d_min^=.);
do sign=-1,1 until (d_min^=.);
do d = sign*strt{r} to sign*finish{r} by sign;
if diag_hist{date+d}^=0 then do;
if d_min=. then d_min=d; else
if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
end;
end;
end;
end;
if d_min^=. then new_diag=diag_hist{date+d_min};
Note the first inner loop ("do sign=-1,1 …") provides a way to set up this sequence of 6 ranges:
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.