Re: Create a variable for before or after a date specific for each pat...

Nina4 · Posted 11-25-2022 12:22 PM

data example1;                      
   input Patient $ Testname $ Testdate MMDDYY8. claim $ Drugclaimdate MMDDYY8.;       
   cards;            

patienta a 03/01/20 ' ' . 
patienta ' ' . druga 03/02/20
patienta ' ' . druga 03/28/20
patienta ' ' . druga 09/15/20
patientb a 03/01/20 ' ' . 
patientb b 06/01/20 ' ' . 
patientb ' ' . druga 02/28/20
patientb ' ' . drugb 03/05/20
patientb ' ' . druga 04/25/20
patientb ' ' . drugb 06/17/20
patientb ' ' . druga 07/05/20
patientb ' ' . druga 09/18/20
patientc ' ' . druga 03/01/20
patientc ' ' . drugb 04/01/20
patientc ' ' . drugc 05/01/20

;    
run;

Hello dear SAS community

I need some help with the following question. I have a dataset with patients and their drug claims. Moreover, in this dataset are certain laboratory tests of the patients. I would like to make a variable interval30 with the values

Interval30=1 if the drug claim is within 30 days before a test of this patient

Interval30=2 if the drug claim is within 30 days after a test of this patient

Interval30=0 if no drug claim is within 30 days before or after a test of this patient.

One person can have only lab test, only drug claims or both (in any number).

The resulting dataset should look like the following.

Patient	Testname	Testdate ddmmyy	Drugclaim	Drugclaimdate ddmmyy	Interval30
patienta	a	01/03/20
patienta			Druga	02/03/20	2
patienta			Druga	28/03/20	2
patienta			Druga	15/09/20	0
patientb	a	01/03/20
patientb	b	01/06/20
patientb			Druga	28/02/20	1
patientb			drugb	05/03/20	2
patientb			Druga	25/04/20	0
patientb			drugb	17/06/20	2
patientb			Druga	05/07/20	0
patientb			Druga	18/09/20	0
patientc			Druga	01/03/20	0
patientc			Drugb	01/04/20	0
patientc			drugc	01/05/20	0

Thank you for your help!

FreelanceReinh · Posted 11-25-2022 03:07 PM

Hello @Nina4,

Let me first improve your data step creating the sample data:

data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;

Variable interval30 could be derived from the elements of a temporary array which is indexed with SAS date values and populated with the values 1 or 2 depending on the test dates of the current patient. Example:

data want(drop=i);
set example1;
by patient;
array _t[0:%sysfunc(today())] _temporary_;
if first.patient then call missing(of _t[*]);
if n(testdate) then do i=0 to 30;
  _t[testdate-i]=1;
  _t[testdate+i]=2;
end;
if n(drugclaimdate) then interval30=max(_t[drugclaimdate],0);
run;

This works for your example data. The array _t above covers the dates between 0 = '01JAN1960'd and today. Instead of 0 and %sysfunc(today()) you can use expressions like %sysevalf('01JAN2019'd) to limit the array to a different date range. Obviously, the data step assumes that dataset example1 is sorted by patient and that the observations with the relevant test dates precede the observations with the drug claim dates.

You need to decide what interval30 should be if the criteria for values 1 and 2 are both met. For example, this occurs always (for a single test date) if drugclaimdate=testdate and the code above defines interval30=2 in this particular case, but this could be changed easily. In the case of overlapping 30-day ranges for different tests of the same patient, the suggested code overwrites the values of the patient's first test (in the dataset) with those of the subsequent tests (e.g., if test b of patientb occurred on 15 Apr 2020). Again, this could be changed to implement a different rule, e.g., always prioritizing value 2.

PaigeMiller · Posted 11-27-2022 10:34 AM

Why is %SYSFUNC needed in this code?

--
Paige Miller

Tom · Posted 11-27-2022 10:53 AM

@PaigeMiller wrote:

Why is %SYSFUNC needed in this code?

Because you need an actual number when specifying the dimensions of an array. You cannot have a function call there.

ErikLund_Jensen · Posted 11-27-2022 03:59 PM

Hi @Nina4

Here is a different approach. Test dates are transposed per patient and read into an array, and each claim is tested against the list of test dates for the current patient. The tests for days-before and days-after are stored i 2 different variables, and then the resulting Interval30 is computed.

In this code days-after (2) has priority, so 2 and 1 makes 2, but it is very easy to change, e.g. make a new category 3 if both days-before and days-after are true.

The first data step reproduces your wanted output, but I took the liberty to write an alternative data step containing claim-resords only with the test dates included, that caused the Interval30 variable to be set for the current claim as a suggestion, because I would prefer this structure as a base for further analysis.

data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;

proc transpose 
    data=example1 (keep=Patient Testdate where=(not missing(testdate))) 
    out=test (drop=_name_);
  ; 
  by Patient;
run;

* Reproduces your wanted output 1:1;
data want (drop=i prev post COL:);
  merge example1 test; 
  by Patient;
  array test COL:;
  call sortn(of test{*});
  if missing(Testdate) then do i = 1 to dim(test);
    if not missing(test{i}) then do;
      if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1;
      if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2;
     end;
  end;
  Interval30 = max(0, prev, post);
run;

* An alternative;
data want2 (drop=i testname testdate prev post COL:);
  merge example1 (where=(missing(Testdate))) test; 
  by Patient;
  format testdate_before testdate_after ddmmyy10.;
  array test COL:;
  call sortn(of test{*});
  do i = 1 to dim(test);
    if not missing(test{i}) then do;
      if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then do;
        testdate_before = test{i};
        prev = 1;
      end;
      if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then do;
        testdate_after = test{i};
        post = 2;
      end;
    end;
  end;
  Interval30 = max(0, prev, post);
run;

s_lassen · Posted 11-28-2022 05:04 AM

I think you get the simplest solution using SQL, e.g.

proc sql;
create table want as select
  *,
  case
    when Drugclaimdate is null then ' '
	when exists(select * from Example1 tests
	             where patient=full.patient
				   and testdate-full.drugclaimdate between -30 and 0) then '1'
	when exists(select * from Example1 tests
	             where patient=full.patient
				   and testdate-full.drugclaimdate between 1 and 30) then '2'
	else '0'
  end as Interval30
  from example1 full
  order by patient,drugclaimdate,testdate;

But if your input table is very large, some of the other solutions shown may work faster (perhaps, I have not tested).

You may want to adjust the interval specifications according to your exact demands (you do not specify how to classify records with a test on the same day as the drug claim).

ErikLund_Jensen · Posted 11-28-2022 06:46 AM

Hi @s_lassen

I was curious, so I did the test on a similar data set with 9999 patients. The solution with Proc Transpose + Data Step used 0.2 second in total (0.1 second CPU), while the Proc SQL solution used more than half an hour CPU time. So while SQL seems simpler to some, the costs can be prohibitive.

But in a choice between several working solutions there are 3 factors to balance against each other in the given situation:

Is the best solution the least ressource-consuming?
or the solution that is easiest to understand and manitain?
or the solution using tools that one is accustomed to and know by heart?

259  proc sql;
260  create table want as
261    select
262      *,
263      case
264        when Drugclaimdate is null then ' '
265          when exists(
266            select * from Example1 as tests
267              where
268                patient=full.patient
269                and
270                testdate - full.drugclaimdate between -30 and 0
271          ) then '1'
272          when exists(
273            select * from Example1 as tests
274              where
275                patient=full.patient
276                and
277                testdate - full.drugclaimdate between 1 and 30
278          ) then '2'
279        else '0'
280      end as Interval30
281    from example1 as full
282    order by patient, drugclaimdate, testdate;
NOTE: Table WORK.WANT created, with 149985 rows and 7 columns.

283    quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           34:17.53
      cpu time            32:49.67


284
285    proc transpose
286      data=example1 (keep=Patient Testdate where=(not missing(testdate)))
287      out=test (drop=_name_);
288    ;
289    by Patient;
290  run;

NOTE: There were 29997 observations read from the data set WORK.EXAMPLE1.
      WHERE not MISSING(testdate);
NOTE: The data set WORK.TEST has 19998 observations and 3 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.09 seconds
      cpu time            0.04 seconds


291
292  * Reproduces your wanted output 1:1;
293  data want (drop=i prev post COL:);
294    merge example1 test;
295    by Patient;
296    array test COL:;
297    call sortn(of test{*});
298    if missing(Testdate) then do i = 1 to dim(test);
299      if not missing(test{i}) then do;
300        if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1;
301        if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2;
302       end;
303    end;
304    Interval30 = max(0, prev, post);
305  run;

NOTE: There were 149985 observations read from the data set WORK.EXAMPLE1.
NOTE: There were 19998 observations read from the data set WORK.TEST.
NOTE: The data set WORK.WANT has 149985 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.06 seconds

s_lassen · Posted 11-28-2022 07:43 AM

You may be able to speed up the SQL solution by indexing your data, e.g.

proc sql;
  create index patient on example1(patient):
quit;

or perhaps:

proc sql;
  create index test_idx on example1(patient,testdate):
quit;

With the latter index it may (depending on the savvy of the SQL interpreter), further improvements may also be possible by changing the EXISTS conditions to something like

where patient=full.patient
  and testdate between full.drugclaimdate+1 and  full.drugclaimdate+30

But if the performance is an issue, one of the other solutions may be better.

Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

Re: Create a variable for before or after a date specific for each patient

SAS Innovate 2025: Call for Content

Classroom Training Available!