Solved: Replace null with value by looking back and ahead a certain number of ...

tk89 · Posted 02-20-2020 12:34 AM

Hello,

I have a table with patientID, admit date and diagnosis code. If the code is null, we look back 2 days to see if there is a valid code within the last two days. If that's null too, we look forward two days, then 5 days back, 5 days forward, and finally 9 days back. We do this each time we encounter a null value for the diagnosis code. I have tried doing a proc sql self join and using lag functions to lag by 2,5, and 9 but it did not help as the dates are not consecutive. So, for a null code, the sequence is to check date-2, then date+2, date-5, date+5 and finally date-9 until a valid code is found. I am using SAS Eg 8.1 Any inputs are greatly appreciated. Thank you!

Have:

PatientID Admit_dt diag
101 20160201 0

101 20160210 190

101 20160215 0

101 20160217 29

101 20160330 23

101 20160405 32

101 20160410 0

101 20160510 12

101 20160510 0

Want:

PatientID Admit_dt diag diag_new
101 2016-02-01 0 0

101 2016-02-10 190 190

101 2016-02-15 0 29

101 2016-02-17 29 29

101 2016-03-30 23 23

101 2016-04-05 32 32

101 2016-04-10 0 32

101 2016-05-10 12 12

101 2016-05-10 0 12

mkeintz · Posted 02-27-2020 05:43 AM

You could create an array RANK_HIST analogous to the DIAG_HIST array.

But you would still have to change the logic of the loop. Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

D_MIN starts out as missing. As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D. But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

At the end of each range, if D_MIN is not missing, no more ranges are checked. And finally if D_MIN is not missing, assign a value to NEW_DIAG.

Something like this (totally untested):

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

Note the first inner loop ("do sign=-1,1 …") provides a way to set up this sequence of 6 ranges:

do d=-1 to -2 by -1
do d=1 to 2 by 1
do d=-3 to -5 by -1
do d=3 to 5 by 1
do d=-6 to -9 by -1
do d=6 to 9

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

andreas_lds · Posted 02-20-2020 02:16 AM

First thing to clarify: are those dates real sas dates? If not: fix this first. Second thing to note: there is a huge difference between null and zero, in the data you have posted there are no null-values. Do you mean zero instead of null? What exactly do you mean by "date+2"? Using the date in the first row, do you want to look for with Admit_dt = 20160203 or do you want to look two observations ahead?

tk89 · Posted 02-24-2020 11:58 PM

Apologies for the wrong terminology. The code is numeric, so if it's a 0 or missing value, then it is considered invalid. Those are dates that I read in as yymddn8. By date+2, I mean looking forward till two days. So, for the first row, since it's the beginning and there's nothing to look back, I start by looking forward till 2 days, i.e., 20160201 - 20160203 and look at claims between the two dates. If there are claims, and any of them have a valid code, I pick that and copy its code and other values (rank, SOI etc) to the first row. If there are no claims or claims with missing or 0 code, then I look for claims between 20160201 and 20160206(i.e., date+5) and repeat the same process.

Please note that I am not jumping backward and forward by the number of rows, rather by the number of days. So, if there are 10 rows between 20160201 and 20160203, then I look at all the 10 rows for my date+2 check. Hope this clarifies my question further. Thank you!

Satish_Parida · Posted 02-20-2020 02:50 AM

I have added conditions for only look forward, you can add the look backward conditions too.

data have;
input PatientID  Admit_dt:yymmdd8.   diag ;
format PatientID diag best12. Admit_dt yymmdd10.;
id=_n_;
cards;
101          20160201  0
101          20160210  190
101          20160215  0
101          20160217  29
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;
run;

data test(drop=Admit_dt diag);
format Admit_dt1 yymmdd10.;
set have;
Admit_dt1=Admit_dt;
diag1=diag;
run;


proc sql;
create table want as
select a.PatientID,a.id,Admit_dt,Admit_dt1,diag,diag1
from have a, test b
where a.PatientID=b.PatientID and diag1 ne 0
order by PatientID, id, Admit_dt1;
quit;

data want1(drop=Admit_dt1 diag1 id);
set want;
retain diag_new;
by PatientID id;
if first.id then diag_new=diag;
if first.id and diag ne 0 then output;
if diag_new=0 then do;
	if Admit_dt1=Admit_dt and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+2 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+5 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+9 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
end;
if last.id and diag_new = 0 then output;
run;

Please let us know if this helped.

tk89 · Posted 02-25-2020 12:01 AM

Thank you for your reply. Your code worked for a sample of my data, but when I run it for the whole table (around 5 million rows), it delves into producing a cartesian product. I am trying to troubleshoot and will update as soon as I can.

mkeintz · Posted 02-20-2020 02:41 PM

Usually looking forward and backward are incremented symmetrically in 1-day intervals, I.e. look at offsets d=-1,d=1, d=-2,d=2, etc. which is easy to embed inside a DO D=1 to 9 loop, using a sign change inside.

But you want this sequence:

d=-1,d=-2, d=1,d=2, d=-3,d=-4,d=-5, d=3,d=4,d=5 ...

so the loop definition has to be more detailed as here:

data have;
  input PatientID  Admit_dt :yymmdd8.  diag;
  format admit_dt date9.;
datalines; 
101          20160201  0
101          20160210  190
101          20160215  0
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;

%let begdate=01jan2016;
%let enddate=31dec2016;

data want (drop=d);
  set have (where=(diag^=0)  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{admit_dt}=diag;

  if in2;
  if diag=0 then do d=0 ,-1,-2,1,2  ,-3,-4,-5,3,4,5  ,-6,-7,-8,-9,6,7,8,9;
    if diag_hist{admit_dt+d}^=. then do; 
      new_diag=diag_hist{admit_dt+d};
      leave;
    end;
  end;
  if new_diag=. then new_diag=diag;
run;

The array DIAG_HIST array will contain all the non-zero DIAG codes in indexed by date, with upper and lower bounds of 01jan2016 and 31dec2016 respectively. It's a temporary array so it retains its value accorss multiple incoming observatinos.
The SET statement reads, for each ID, (a) first all the non-zero diag's and puts them in the array, (b) then all diag's and in the case of diag=0, searches the array for nearby diagnosis codes.
As I mentioned above note the definition of the DO loop. In addition to batching numbers -2,-1,1,2,... I start out with D=0. This is to first capture instances in which a date might have 2 records, in which one has a non-null diagnosis.
If you want to know which observations has been used to correct a null diagnosis, then eliminate the "drop=d" data set name parameter.
Also note I set two macrovars BEGDATE and ENDDATE, which are then used inside a %sysevalf macro functions to establish the bound of the diag_hist array.

Edited additional note. I guess one could somehow set up the loop iteration values using macrovars - especially if you want to modify the date-search cutpoints, currently at {-2,2}, {-5,5}, and {-9,9}, but for now I think that goes beyond the proper scope of this topic.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tk89 · Posted 02-25-2020 12:03 AM

Thank you for the explanation. I will admit that I am rusty with arrays, so I will adopt this logic for my data and update here.

ChrisNZ · Posted 02-25-2020 02:16 AM

You can replace the loop:

if ^DIAG then NEW_DIAG=coalesce(
 diag_hist{admit_dt-1}
,diag_hist{admit_dt-2}
,diag_hist{admit_dt+1}
,diag_hist{admit_dt+2}
,diag_hist{admit_dt-3}
,diag_hist{admit_dt-4}
,diag_hist{admit_dt-5}
,diag_hist{admit_dt+3}
,diag_hist{admit_dt+4}
,diag_hist{admit_dt+5}
,diag_hist{admit_dt-6}
,diag_hist{admit_dt-7}
,diag_hist{admit_dt-8}
,diag_hist{admit_dt-9)
);

for, maybe, better speed or better clarity.

High-Performance SAS Coding - Third Edition

mkeintz · Posted 02-26-2020 10:02 AM

Chris:

One of the OP's data rows has DIAG=0 but has a duplicate date of another with DIAG^=0. I think you'll need to insert "diag_hist(admit_dt}" at the front of the coalesce arguments.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tk89 · Posted 02-25-2020 08:10 PM

Hi @mkeintz ,

I adopted your code for my data and it fails for one patientID when tested on a sample. The diag is a character field so I have made changes accordingly. I am getting the 'Array subscript out of range' error message and trying to figure that out. Please let me know what I'm doing wrong. Thank you!

data have;
   set nonspec5;
   where PatientID in ('4000052628862:40000526288622'
                      ,'4000052628862:40000526288627'
                      ,'4000122628869:40001226288692'
                      ,'4000322628869:40003226288692'
);
   format clistart date9.;
   keep PatientID clistart validDiag ;
run;

%let begdate=01jan2012;
%let enddate=31dec2012;

data want (drop=d);
  set have (where=(validDiag^=' ')  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{clistart}=validDiag;

  if in2;
  if validDiag=' ' then do d=0,-2,2,-5,5,-9;
    if diag_hist{clistart+d}^=' ' then do; 
      new_diag=diag_hist{clistart+d};
      leave;
    end;
  end;
  if new_diag=' ' then new_diag=validDiag;
run;

Here's my log:

28 data want (drop=d);
29 set have (where=(validDiag^=' ') in=in_diag)
30 have (in=in2);
31 by patientid;
32
33 array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;
34
35 if first.patientid then call missing(of diag_hist{*});
36 if in_diag then diag_hist{clistart}=validDiag;
37
38 if in2;
39 if validDiag=' ' then do d=0,-2,2,-5,5,-9;
40 if diag_hist{clistart+d}^=' ' then do;
41 new_diag=diag_hist{clistart+d};
42 leave;
43 end;
44 end;
45 if new_diag=' ' then new_diag=validDiag;
46 run;

ERROR: Array subscript out of range at line 40 column 8.
in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9
new_diag= _ERROR_=1 _N_=45
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 14 observations read from the data set WORK.HAVE.
WHERE validDiag not = ' ';
NOTE: There were 32 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 30 observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.

ChrisNZ · Posted 02-25-2020 09:19 PM

Read the log, it's all there! 🙂

ERROR: Array subscript out of range at line 40 column 8.

Line 40 is:

40 if diag_hist{clistart+d}^=' ' then do;

The values are:

in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9 new_diag= _ERROR_=1 _N_=45

06JAN2012 - 9 is outside of the array bounds.

Either have an array then spans more then a year, or test the index value before trying to read the array.

Which is exactly was the message reads: Array subscript out of range

High-Performance SAS Coding - Third Edition

mkeintz · Posted 02-26-2020 10:05 AM

You have an observation on the 6th of Jan in 2016. The loop searches up to 9 days before (ie. late December 2015). So either establish an earlier lower bound for the array, or test for date exceptions.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

tk89 · Posted 02-26-2020 03:53 PM

@mkeintz Thank you! Now I understand. One last question: Here, we are looking at a particular point of data in the array, how can I modify it to look at a range? For example, for date 15feb2016, at d=-5, we are looking at 10feb2016 in the array. Instead, how can I look at all the claims that exist between 15feb2016 and 10feb2016 (i.e., on 15,14,13,12,11, and 10feb2016)? I ask because for d=-5, I look at all claims within the 5 days and pick the one with the least Rank( a variable I pull from a different table). Please let me know if this is out of scope and I need to post this as a separate question. I'm sorry that my question was not clear from the get go, will do better henceforth.

mkeintz · Posted 02-27-2020 05:43 AM

You could create an array RANK_HIST analogous to the DIAG_HIST array.

But you would still have to change the logic of the loop. Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

D_MIN starts out as missing. As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D. But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

At the end of each range, if D_MIN is not missing, no more ranges are checked. And finally if D_MIN is not missing, assign a value to NEW_DIAG.

Something like this (totally untested):

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

Note the first inner loop ("do sign=-1,1 …") provides a way to set up this sequence of 6 ranges:

do d=-1 to -2 by -1
do d=1 to 2 by 1
do d=-3 to -5 by -1
do d=3 to 5 by 1
do d=-6 to -9 by -1
do d=6 to 9

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

Re: Replace null with value by looking back and ahead a certain number of days

SAS Innovate 2025: Call for Content

Classroom Training Available!