BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
tk89
Obsidian | Level 7

Hello,

 

I have a table with patientID, admit date and diagnosis code. If the code is null, we look back 2 days to see if there is a valid code within the last two days. If that's null too, we look forward two days, then 5 days back, 5 days forward, and finally 9 days back. We do this each time we encounter a null value for the diagnosis code. I have tried doing a proc sql self join and using lag functions to lag by 2,5, and 9 but it did not help as the dates are not consecutive. So, for a null code, the sequence is to check date-2, then date+2, date-5, date+5 and finally date-9 until a valid code is found. I am using SAS Eg 8.1 Any inputs are greatly appreciated. Thank you!

 

Have:

PatientID  Admit_dt   diag 
101          20160201  0

101          20160210  190

101          20160215  0

101          20160217  29

101          20160330  23

101          20160405  32

101          20160410  0

101          20160510  12

101          20160510  0

 

Want:

PatientID  Admit_dt       diag     diag_new
101          2016-02-01      0        0

101          2016-02-10      190    190

101          2016-02-15      0        29

101          2016-02-17      29      29

101          2016-03-30      23      23

101          2016-04-05      32      32

101          2016-04-10      0        32

101          2016-05-10      12      12

101          2016-05-10      0        12

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

You could create an array RANK_HIST analogous to the DIAG_HIST array.

 

But you would still have to change the logic of the loop.  Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

 

D_MIN starts out as missing.  As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D.  But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

 

At the end of each range, if D_MIN is not missing, no more ranges are checked.  And finally if D_MIN is not missing, assign a value to NEW_DIAG.

 

Something like this (totally untested):

 

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

 

Note the first inner loop ("do sign=-1,1 …")  provides a way to set up this sequence of 6 ranges:

  • do d=-1 to -2 by -1
  • do d=1 to 2 by 1
  • do d=-3 to -5 by -1
  • do d=3 to 5 by 1
  • do d=-6 to -9 by -1
  • do d=6 to 9

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

13 REPLIES 13
andreas_lds
Jade | Level 19

First thing to clarify: are those dates real sas dates? If not: fix this first. Second thing to note: there is a huge difference between null and zero, in the data you have posted there are no null-values. Do you mean zero instead of null? What exactly do you mean by "date+2"? Using the date in the first row, do you want to look for with Admit_dt = 20160203 or do you want to look two observations ahead?

tk89
Obsidian | Level 7

Apologies for the wrong terminology. The code is numeric, so if it's a 0 or missing value, then it is considered invalid. Those are dates that I read in as yymddn8. By date+2, I mean looking forward till two days. So, for the first row, since it's the beginning and there's nothing to look back, I start by looking forward till 2 days, i.e., 20160201 - 20160203 and look at claims between the two dates. If there are claims, and any of them have a valid code, I pick that and copy its code and other values (rank, SOI etc) to the first row. If there are no claims or claims with missing or 0 code, then I look for claims between 20160201 and 20160206(i.e., date+5) and repeat the same process.

 

Please note that I am not jumping backward and forward by the number of rows, rather by the number of days. So, if there are 10 rows between 20160201 and 20160203, then I look at all the 10 rows for my date+2 check. Hope this clarifies my question further. Thank you!

Satish_Parida
Lapis Lazuli | Level 10

I have added conditions for only look forward, you can add the look backward conditions too.

 

data have;
input PatientID  Admit_dt:yymmdd8.   diag ;
format PatientID diag best12. Admit_dt yymmdd10.;
id=_n_;
cards;
101          20160201  0
101          20160210  190
101          20160215  0
101          20160217  29
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;
run;

data test(drop=Admit_dt diag);
format Admit_dt1 yymmdd10.;
set have;
Admit_dt1=Admit_dt;
diag1=diag;
run;


proc sql;
create table want as
select a.PatientID,a.id,Admit_dt,Admit_dt1,diag,diag1
from have a, test b
where a.PatientID=b.PatientID and diag1 ne 0
order by PatientID, id, Admit_dt1;
quit;

data want1(drop=Admit_dt1 diag1 id);
set want;
retain diag_new;
by PatientID id;
if first.id then diag_new=diag;
if first.id and diag ne 0 then output;
if diag_new=0 then do;
	if Admit_dt1=Admit_dt and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+2 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+5 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
	else if Admit_dt1=Admit_dt+9 and diag1 ne 0 then do;
		diag_new=diag1;
		output;
	end;
end;
if last.id and diag_new = 0 then output;
run;

Please let us know if this helped.

 

tk89
Obsidian | Level 7

Thank you for your reply. Your code worked for a sample of my data, but when I run it for the whole table (around 5 million rows), it delves into producing a cartesian product. I am trying to troubleshoot and will update as soon as I can.

mkeintz
PROC Star

Usually looking forward and backward are incremented symmetrically in 1-day intervals,  I.e. look at offsets   d=-1,d=1,  d=-2,d=2, etc.  which is easy to embed inside a DO D=1 to 9 loop, using a sign change inside.

But you want this sequence:

    d=-1,d=-2,   d=1,d=2,    d=-3,d=-4,d=-5,  d=3,d=4,d=5 ...

so the loop definition has to be more detailed as here:

data have;
  input PatientID  Admit_dt :yymmdd8.  diag;
  format admit_dt date9.;
datalines; 
101          20160201  0
101          20160210  190
101          20160215  0
101          20160330  23
101          20160405  32
101          20160410  0
101          20160510  12
101          20160510  0
;

%let begdate=01jan2016;
%let enddate=31dec2016;

data want (drop=d);
  set have (where=(diag^=0)  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{admit_dt}=diag;

  if in2;
  if diag=0 then do d=0 ,-1,-2,1,2  ,-3,-4,-5,3,4,5  ,-6,-7,-8,-9,6,7,8,9;
    if diag_hist{admit_dt+d}^=. then do; 
      new_diag=diag_hist{admit_dt+d};
      leave;
    end;
  end;
  if new_diag=. then new_diag=diag;
run;
  1. The array DIAG_HIST array will contain all the non-zero DIAG codes in indexed by date, with upper and lower bounds of 01jan2016 and 31dec2016 respectively.  It's a temporary array so it retains its value accorss multiple incoming observatinos.

  2. The SET statement reads, for each ID,   (a) first all the non-zero diag's and puts them in the array, (b) then all diag's and in the case of diag=0, searches the array for nearby diagnosis codes.
  3. As I mentioned above note the definition of the DO loop.  In addition to batching numbers    -2,-1,1,2,... I start out with D=0.  This is to first capture instances in which a date might have 2 records, in which one has a non-null diagnosis.
  4. If you want to know which observations has been used to correct a null diagnosis, then eliminate the "drop=d" data set name parameter.
  5. Also note I set two macrovars BEGDATE and ENDDATE, which are then used inside a %sysevalf macro functions to establish the bound of the diag_hist array.

Edited additional note.  I guess one could somehow set up the loop iteration values using macrovars - especially if you want to modify the date-search cutpoints,  currently at {-2,2}, {-5,5},  and {-9,9}, but for now I think that goes beyond the proper scope of this topic.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

Thank you for the explanation. I will admit that I am rusty with arrays, so I will adopt this logic for my data and update here.

ChrisNZ
Tourmaline | Level 20

 

You can replace the loop:

if ^DIAG then NEW_DIAG=coalesce(
 diag_hist{admit_dt-1}
,diag_hist{admit_dt-2}
,diag_hist{admit_dt+1}
,diag_hist{admit_dt+2}
,diag_hist{admit_dt-3}
,diag_hist{admit_dt-4}
,diag_hist{admit_dt-5}
,diag_hist{admit_dt+3}
,diag_hist{admit_dt+4}
,diag_hist{admit_dt+5}
,diag_hist{admit_dt-6}
,diag_hist{admit_dt-7}
,diag_hist{admit_dt-8}
,diag_hist{admit_dt-9)
); 

for, maybe, better speed or better clarity.

 

mkeintz
PROC Star

Chris:

 

One of the OP's data rows has DIAG=0 but has a duplicate date of another with DIAG^=0.  I think you'll need to insert "diag_hist(admit_dt}" at the front of the coalesce arguments.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

Hi @mkeintz ,

 

I adopted your code for my data and it fails for one patientID when tested on a sample. The diag is a character field so I have made changes accordingly. I am getting the 'Array subscript out of range' error message and trying to figure that out. Please let me know what I'm doing wrong. Thank you!

data have;
   set nonspec5;
   where PatientID in ('4000052628862:40000526288622'
                      ,'4000052628862:40000526288627'
                      ,'4000122628869:40001226288692'
                      ,'4000322628869:40003226288692'
);
   format clistart date9.;
   keep PatientID clistart validDiag ;
run;

%let begdate=01jan2012;
%let enddate=31dec2012;

data want (drop=d);
  set have (where=(validDiag^=' ')  in=in_diag)
      have (in=in2);
  by patientid;

  array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;

  if first.patientid then call missing(of diag_hist{*});
  if in_diag then diag_hist{clistart}=validDiag;

  if in2;
  if validDiag=' ' then do d=0,-2,2,-5,5,-9;
    if diag_hist{clistart+d}^=' ' then do; 
      new_diag=diag_hist{clistart+d};
      leave;
    end;
  end;
  if new_diag=' ' then new_diag=validDiag;
run;

Here's my log:

 

28 data want (drop=d);
29 set have (where=(validDiag^=' ') in=in_diag)
30 have (in=in2);
31 by patientid;
32
33 array diag_hist {%sysevalf("&begdate"d):%sysevalf("&enddate"d)} $ _temporary_;
34
35 if first.patientid then call missing(of diag_hist{*});
36 if in_diag then diag_hist{clistart}=validDiag;
37
38 if in2;
39 if validDiag=' ' then do d=0,-2,2,-5,5,-9;
40 if diag_hist{clistart+d}^=' ' then do;
41 new_diag=diag_hist{clistart+d};
42 leave;
43 end;
44 end;
45 if new_diag=' ' then new_diag=validDiag;
46 run;

ERROR: Array subscript out of range at line 40 column 8.
in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9
new_diag= _ERROR_=1 _N_=45
NOTE: The SAS System stopped processing this step because of errors.
NOTE: There were 14 observations read from the data set WORK.HAVE.
WHERE validDiag not = ' ';
NOTE: There were 32 observations read from the data set WORK.HAVE.
WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 30 observations and 4 variables.
WARNING: Data set WORK.WANT was not replaced because this step was stopped.

ChrisNZ
Tourmaline | Level 20

Read the log, it's all there! 🙂

 

ERROR: Array subscript out of range at line 40 column 8.

 

Line 40 is: 

40 if diag_hist{clistart+d}^=' ' then do;

 

The values are:

in_diag=0 in2=1 PatientID=4000322628869:40003226288692 CliStart=06JAN2012 validDiag= FIRST.PatientID=0 LAST.PatientID=0 d=-9 new_diag= _ERROR_=1 _N_=45

 

06JAN2012 - 9 is outside of the array bounds.

 

Either have an array then spans more then a year, or test the index value before trying to read the array.

Which is exactly was the message reads: Array subscript out of range

 

mkeintz
PROC Star

You have an observation on the 6th of Jan in 2016.  The loop searches up to 9 days before (ie. late December 2015).  So either establish an earlier lower bound for the array, or test for date exceptions.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
tk89
Obsidian | Level 7

@mkeintz Thank you! Now I understand. One last question: Here, we are looking at a particular point of data in the array, how can I modify it to look at a range? For example, for date 15feb2016, at d=-5, we are looking at 10feb2016 in the array. Instead, how can I look at all the claims that exist between 15feb2016 and 10feb2016 (i.e., on 15,14,13,12,11, and 10feb2016)? I ask because for d=-5, I look at all claims within the 5 days and pick the one with the least Rank( a variable I pull from a different table). Please let me know if this is out of scope and I need to post this as a separate question. I'm sorry that my question was not clear from the get go, will do better henceforth. 

mkeintz
PROC Star

You could create an array RANK_HIST analogous to the DIAG_HIST array.

 

But you would still have to change the logic of the loop.  Instead of leaving the loop with the first non-zero DIAG_HIST you have to maintain another variable, say D_MIN, which is the value of D corresponding to the minimum RANK_HIST for non-zero DIAG_HIST.

 

D_MIN starts out as missing.  As you go over each range (i.e. the offset D goes from start-of-range to end-of-range), whenever you hit a non-zero DIAG_HIST, then if D_MIN is missing, set it to D.  But if D_MIN is not missing, compare the RANK_HIST for D to the same for D_MIN, and update D_MIN accordingly.

 

At the end of each range, if D_MIN is not missing, no more ranges are checked.  And finally if D_MIN is not missing, assign a value to NEW_DIAG.

 

Something like this (totally untested):

 

  new_diag =coalesce(diag_hist{date},diag);

  array strt   {3} _temporary_  (1,3,6);
  array finish {3} _temporary_  (2,5,9);


  d_min=.;

  if new_diag^=0 then do r=1 to 3 until (d_min^=.);   
    do sign=-1,1 until (d_min^=.);
      do d = sign*strt{r} to sign*finish{r} by sign;
        if diag_hist{date+d}^=0 then do;
          if d_min=. then d_min=d;   else
          if rank_hist{date+d} < rank_hist{date+d_min} then d_min=d;
        end;
      end;
    end;
  end;
  if d_min^=. then new_diag=diag_hist{date+d_min};

 

Note the first inner loop ("do sign=-1,1 …")  provides a way to set up this sequence of 6 ranges:

  • do d=-1 to -2 by -1
  • do d=1 to 2 by 1
  • do d=-3 to -5 by -1
  • do d=3 to 5 by 1
  • do d=-6 to -9 by -1
  • do d=6 to 9

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 13 replies
  • 2946 views
  • 5 likes
  • 5 in conversation