BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tk89
Obsidian | Level 7

Hello,

 

I have a table with patientID, admit date and diagnosis code. If the code is null, we look back 2 days to see if there is a valid code within the last two days. If that's null too, we look forward two days, then 5 days back, 5 days forward, and finally 9 days back. We do this each time we encounter a null value for the diagnosis code. I have tried doing a proc sql self join and using lag functions to lag by 2,5, and 9 but it did not help as the dates are not consecutive. So, for a null code, the sequence is to check date-2, then date+2, date-5, date+5 and finally date-9 until a valid code is found. I am using SAS Eg 8.1 Any inputs are greatly appreciated. Thank you!

 

Have:

PatientID  Admit_dt   diag 
101          20160201  0

101          20160210  190

101          20160215  0

101          20160217  29

101          20160330  23

101          20160405  32

101          20160410  0

101          20160510  12

101          20160510  0

 

Want:

PatientID  Admit_dt       diag     diag_new
101          2016-02-01      0        0

101          2016-02-10      190    190

101          2016-02-15      0        29

101          2016-02-17      29      29

101          2016-03-30      23      23

101          2016-04-05      32      32

101          2016-04-10      0        32

101          2016-05-10      12      12

101          2016-05-10      0        12

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You could create an array RANK_HIST analogous to the DIAG_HIST array.

 

But you would still have to change the logic of the loop.  Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

 

D_MIN starts out as missing.  As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D.  But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

 

At the end of each range, if D_MIN is not missing, no more ranges are checked.  And finally if D_MIN is not missing, assign a value to NEW_DIAG.

 

Something like this (totally untested):

 

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

 

Note the first inner loop ("do sign=-1,1 …")  provides a way to set up this sequence of 6 ranges:

  • do d=-1 to -2 by -1
  • do d=1 to 2 by 1
  • do d=-3 to -5 by -1
  • do d=3 to 5 by 1
  • do d=-6 to -9 by -1
  • do d=6 to 9

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
andreas_lds
Jade | Level 19

First thing to clarify: are those dates real sas dates? If not: fix this first. Second thing to note: there is a huge difference between null and zero, in the data you have posted there are no null-values. Do you mean zero instead of null? What exactly do you mean by "date+2"? Using the date in the first row, do you want to look for with Admit_dt = 20160203 or do you want to look two observations ahead?

tk89
Obsidian | Level 7

Apologies for the wrong terminology. The code is numeric, so if it's a 0 or missing value, then it is considered invalid. Those are dates that I read in as yymddn8. By date+2, I mean looking forward till two days. So, for the first row, since it's the beginning and there's nothing to look back, I start by looking forward till 2 days, i.e., 20160201 - 20160203 and look at claims between the two dates. If there are claims, and any of them have a valid code, I pick that and copy its code and other values (rank, SOI etc) to the first row. If there are no claims or claims with missing or 0 code, then I look for claims between 20160201 and 20160206(i.e., date+5) and repeat the same process.

 

Please note that I am not jumping backward and forward by the number of rows, rather by the number of days. So, if there are 10 rows between 20160201 and 20160203, then I look at all the 10 rows for my date+2 check. Hope this clarifies my question further. Thank you!

Satish_Parida
Lapis Lazuli | Level 10

I have added conditions for only look forward, you can add the look backward conditions too.

 

data have;
input PatientID  Admit_dt:yymmdd8.   diag ;
format PatientID diag best12. Admit_dt yymmdd10.;
id=_n_;
cards;
101          20160201  0
101          20160210  190
101          20160215  0
101          20160217  29
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;
run;

data test(drop=Admit_dt diag);
format Admit_dt1 yymmdd10.;
set have;
Admit_dt1=Admit_dt;
diag1=diag;
run;


proc sql;
create table want as
select a.PatientID,a.id,Admit_dt,Admit_dt1,diag,diag1
from have a, test b
where a.PatientID=b.PatientID and diag1 ne 0
order by PatientID, id, Admit_dt1;
quit;

data want1(drop=Admit_dt1 diag1 id);
set want;
retain diag_new;
by PatientID id;
if first.id then diag_new=diag;
if first.id and diag ne 0 then output;
if diag_new=0 then do;
	if Admit_dt1=Admit_dt and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+2 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+5 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+9 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
end;
if last.id and diag_new = 0 then output;
run;

Please let us know if this helped.

 

tk89
Obsidian | Level 7

Thank you for your reply. Your code worked for a sample of my data, but when I run it for the whole table (around 5 million rows), it delves into producing a cartesian product. I am trying to troubleshoot and will update as soon as I can.

mkeintz
PROC Star

Usually looking forward and backward are incremented symmetrically in 1-day intervals,  I.e. look at offsets   d=-1,d=1,  d=-2,d=2, etc.  which is easy to embed inside a DO D=1 to 9 loop, using a sign change inside.

But you want this sequence:

    d=-1,d=-2,   d=1,d=2,    d=-3,d=-4,d=-5,  d=3,d=4,d=5 ...

so the loop definition has to be more detailed as here:

data have;
  input PatientID  Admit_dt :yymmdd8.  diag;
  format admit_dt date9.;
datalines; 
101          20160201  0
101          20160210  190
101          20160215  0
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;

%let begdate=01jan2016;
%let enddate=31dec2016;

data want (drop=d);
  set have (where=(diag^=0)  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{admit_dt}=diag;

  if in2;
  if diag=0 then do d=0 ,-1,-2,1,2  ,-3,-4,-5,3,4,5  ,-6,-7,-8,-9,6,7,8,9;
    if diag_hist{admit_dt+d}^=. then do; 
      new_diag=diag_hist{admit_dt+d};
      leave;
    end;
  end;
  if new_diag=. then new_diag=diag;
run;
  1. The array DIAG_HIST array will contain all the non-zero DIAG codes in indexed by date, with upper and lower bounds of 01jan2016 and 31dec2016 respectively.  It's a temporary array so it retains its value accorss multiple incoming observatinos.

  2. The SET statement reads, for each ID,   (a) first all the non-zero diag's and puts them in the array, (b) then all diag's and in the case of diag=0, searches the array for nearby diagnosis codes.
  3. As I mentioned above note the definition of the DO loop.  In addition to batching numbers    -2,-1,1,2,... I start out with D=0.  This is to first capture instances in which a date might have 2 records, in which one has a non-null diagnosis.
  4. If you want to know which observations has been used to correct a null diagnosis, then eliminate the "drop=d" data set name parameter.
  5. Also note I set two macrovars BEGDATE and ENDDATE, which are then used inside a %sysevalf macro functions to establish the bound of the diag_hist array.

Edited additional note.  I guess one could somehow set up the loop iteration values using macrovars - especially if you want to modify the date-search cutpoints,  currently at {-2,2}, {-5,5},  and {-9,9}, but for now I think that goes beyond the proper scope of this topic.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

Thank you for the explanation. I will admit that I am rusty with arrays, so I will adopt this logic for my data and update here.

ChrisNZ
Tourmaline | Level 20

 

You can replace the loop:

if ^DIAG then NEW_DIAG=coalesce(
 diag_hist{admit_dt-1}
,diag_hist{admit_dt-2}
,diag_hist{admit_dt+1}
,diag_hist{admit_dt+2}
,diag_hist{admit_dt-3}
,diag_hist{admit_dt-4}
,diag_hist{admit_dt-5}
,diag_hist{admit_dt+3}
,diag_hist{admit_dt+4}
,diag_hist{admit_dt+5}
,diag_hist{admit_dt-6}
,diag_hist{admit_dt-7}
,diag_hist{admit_dt-8}
,diag_hist{admit_dt-9)
); 

for, maybe, better speed or better clarity.

 

mkeintz
PROC Star

Chris:

 

One of the OP's data rows has DIAG=0 but has a duplicate date of another with DIAG^=0.  I think you'll need to insert "diag_hist(admit_dt}" at the front of the coalesce arguments.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

Hi @mkeintz ,

 

I adopted your code for my data and it fails for one patientID when tested on a sample. The diag is a character field so I have made changes accordingly. I am getting the 'Array subscript out of range' error message and trying to figure that out. Please let me know what I'm doing wrong. Thank you!

data have;
   set nonspec5;
   where PatientID in ('4000052628862:40000526288622'
                      ,'4000052628862:40000526288627'
                      ,'4000122628869:40001226288692'
                      ,'4000322628869:40003226288692'
);
   format clistart date9.;
   keep PatientID clistart validDiag ;
run;

%let begdate=01jan2012;
%let enddate=31dec2012;

data want (drop=d);
  set have (where=(validDiag^=' ')  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{clistart}=validDiag;

  if in2;
  if validDiag=' ' then do d=0,-2,2,-5,5,-9;
    if diag_hist{clistart+d}^=' ' then do; 
      new_diag=diag_hist{clistart+d};
      leave;
    end;
  end;
  if new_diag=' ' then new_diag=validDiag;
run;

Here's my log:

 

28 data want (drop=d);
29 set have (where=(validDiag^=' ') in=in_diag)
30 have (in=in2);
31 by patientid;
32
33 array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;
34
35 if first.patientid then call missing(of diag_hist{*});
36 if in_diag then diag_hist{clistart}=validDiag;
37
38 if in2;
39 if validDiag=' ' then do d=0,-2,2,-5,5,-9;
40 if diag_hist{clistart+d}^=' ' then do;
41 new_diag=diag_hist{clistart+d};
42 leave;
43 end;
44 end;
45 if new_diag=' ' then new_diag=validDiag;
46 run;

ERROR: Array subscript out of range at line 40 column 8.
in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9
new_diag= _ERROR_=1 _N_=45
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 14 observations read from the data set WORK.HAVE.
WHERE validDiag not = ' ';
NOTE: There were 32 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 30 observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.

ChrisNZ
Tourmaline | Level 20

Read the log, it's all there! 🙂

 

ERROR: Array subscript out of range at line 40 column 8.

 

Line 40 is: 

40 if diag_hist{clistart+d}^=' ' then do;

 

The values are:

in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9 new_diag= _ERROR_=1 _N_=45

 

06JAN2012 - 9 is outside of the array bounds.

 

Either have an array then spans more then a year, or test the index value before trying to read the array.

Which is exactly was the message reads: Array subscript out of range

 

mkeintz
PROC Star

You have an observation on the 6th of Jan in 2016.  The loop searches up to 9 days before (ie. late December 2015).  So either establish an earlier lower bound for the array, or test for date exceptions.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

@mkeintz Thank you! Now I understand. One last question: Here, we are looking at a particular point of data in the array, how can I modify it to look at a range? For example, for date 15feb2016, at d=-5, we are looking at 10feb2016 in the array. Instead, how can I look at all the claims that exist between 15feb2016 and 10feb2016 (i.e., on 15,14,13,12,11, and 10feb2016)? I ask because for d=-5, I look at all claims within the 5 days and pick the one with the least Rank( a variable I pull from a different table). Please let me know if this is out of scope and I need to post this as a separate question. I'm sorry that my question was not clear from the get go, will do better henceforth. 

mkeintz
PROC Star

You could create an array RANK_HIST analogous to the DIAG_HIST array.

 

But you would still have to change the logic of the loop.  Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

 

D_MIN starts out as missing.  As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D.  But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

 

At the end of each range, if D_MIN is not missing, no more ranges are checked.  And finally if D_MIN is not missing, assign a value to NEW_DIAG.

 

Something like this (totally untested):

 

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

 

Note the first inner loop ("do sign=-1,1 …")  provides a way to set up this sequence of 6 ranges:

  • do d=-1 to -2 by -1
  • do d=1 to 2 by 1
  • do d=-3 to -5 by -1
  • do d=3 to 5 by 1
  • do d=-6 to -9 by -1
  • do d=6 to 9

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2990 views
  • 5 likes
  • 5 in conversation