BookmarkSubscribeRSS Feed
Nina4
Obsidian | Level 7
data example1;                      
   input Patient $ Testname $ Testdate MMDDYY8. claim $ Drugclaimdate MMDDYY8.;       
   cards;            

patienta a 03/01/20 ' ' . 
patienta ' ' . druga 03/02/20
patienta ' ' . druga 03/28/20
patienta ' ' . druga 09/15/20
patientb a 03/01/20 ' ' . 
patientb b 06/01/20 ' ' . 
patientb ' ' . druga 02/28/20
patientb ' ' . drugb 03/05/20
patientb ' ' . druga 04/25/20
patientb ' ' . drugb 06/17/20
patientb ' ' . druga 07/05/20
patientb ' ' . druga 09/18/20
patientc ' ' . druga 03/01/20
patientc ' ' . drugb 04/01/20
patientc ' ' . drugc 05/01/20

;    
run; 

Hello dear SAS community 

 

I need some help with the following question. I have a dataset with patients and their drug claims. Moreover, in this dataset are certain laboratory tests of the patients. I would like to make a variable interval30 with the values 

 

Interval30=1 if the drug claim is within 30 days before a test of this patient

Interval30=2 if the drug claim is within 30 days after a test of this patient

Interval30=0 if no drug claim is within 30 days before or after a test of this patient.

 

One person can have only lab test, only drug claims or both (in any number).

 

The resulting dataset should look like the following. 

 

Patient

Testname

Testdate ddmmyy

Drugclaim

Drugclaimdate

ddmmyy

Interval30

patienta

a

01/03/20

 

 

 

patienta

 

 

Druga

02/03/20

2

patienta

 

 

Druga

28/03/20

2

patienta

 

 

Druga

15/09/20

0

patientb

a

01/03/20

 

 

 

patientb

b

01/06/20

 

 

 

patientb

 

 

Druga

28/02/20

1

patientb

 

 

drugb

05/03/20

2

patientb

 

 

Druga

25/04/20

0

patientb

 

 

drugb

17/06/20

2

patientb

 

 

Druga

05/07/20

0

patientb

 

 

Druga

18/09/20

0

patientc

 

 

Druga

01/03/20

0

patientc

 

 

Drugb

01/04/20

0

patientc

 

 

drugc

01/05/20

0

 

Thank you for your help!

7 REPLIES 7
FreelanceReinh
Jade | Level 19

Hello @Nina4,

 

Let me first improve your data step creating the sample data:

data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;

 

Variable interval30 could be derived from the elements of a temporary array which is indexed with SAS date values and populated with the values 1 or 2 depending on the test dates of the current patient. Example:

data want(drop=i);
set example1;
by patient;
array _t[0:%sysfunc(today())] _temporary_;
if first.patient then call missing(of _t[*]);
if n(testdate) then do i=0 to 30;
  _t[testdate-i]=1;
  _t[testdate+i]=2;
end;
if n(drugclaimdate) then interval30=max(_t[drugclaimdate],0);
run;

This works for your example data. The array _t above covers the dates between 0 = '01JAN1960'd and today. Instead of 0 and %sysfunc(today()) you can use expressions like %sysevalf('01JAN2019'd) to limit the array to a different date range. Obviously, the data step assumes that dataset example1 is sorted by patient and that the observations with the relevant test dates precede the observations with the drug claim dates.

 

You need to decide what interval30 should be if the criteria for values 1 and 2 are both met. For example, this occurs always (for a single test date) if drugclaimdate=testdate and the code above defines interval30=2 in this particular case, but this could be changed easily. In the case of overlapping 30-day ranges for different tests of the same patient, the suggested code overwrites the values of the patient's first test (in the dataset) with those of the subsequent tests (e.g., if test b of patientb occurred on 15 Apr 2020). Again, this could be changed to implement a different rule, e.g., always prioritizing value 2.

PaigeMiller
Diamond | Level 26

Why is %SYSFUNC needed in this code?

--
Paige Miller
Tom
Super User Tom
Super User

@PaigeMiller wrote:

Why is %SYSFUNC needed in this code?


Because you need an actual number when specifying the dimensions of an array.  You cannot have a function call there.

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @Nina4 

 

Here is a different approach. Test dates are transposed per patient and read into an array, and each claim is tested against the list of test dates for the current patient. The tests for days-before and days-after are stored i 2 different variables, and then the resulting Interval30 is computed. 

 

In this code days-after (2) has priority, so 2 and 1 makes 2, but it is very easy to change, e.g. make a new category 3 if both days-before and days-after are true.

 

The first data step reproduces your wanted output, but I took the liberty to write an alternative data step containing claim-resords only with the test dates included, that caused the Interval30 variable to be set for the current claim as a suggestion, because I would prefer this structure as a base for further analysis.

 

data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;

proc transpose 
    data=example1 (keep=Patient Testdate where=(not missing(testdate))) 
    out=test (drop=_name_);
  ; 
  by Patient;
run;

* Reproduces your wanted output 1:1;
data want (drop=i prev post COL:);
  merge example1 test; 
  by Patient;
  array test COL:;
  call sortn(of test{*});
  if missing(Testdate) then do i = 1 to dim(test);
    if not missing(test{i}) then do;
      if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1;
      if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2;
     end;
  end;
  Interval30 = max(0, prev, post);
run;

* An alternative;
data want2 (drop=i testname testdate prev post COL:);
  merge example1 (where=(missing(Testdate))) test; 
  by Patient;
  format testdate_before testdate_after ddmmyy10.;
  array test COL:;
  call sortn(of test{*});
  do i = 1 to dim(test);
    if not missing(test{i}) then do;
      if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then do;
        testdate_before = test{i};
        prev = 1;
      end;
      if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then do;
        testdate_after = test{i};
        post = 2;
      end;
    end;
  end;
  Interval30 = max(0, prev, post);
run;

 

s_lassen
Meteorite | Level 14

I think you get the simplest solution using SQL, e.g.

proc sql;
create table want as select
  *,
  case
    when Drugclaimdate is null then ' '
	when exists(select * from Example1 tests
	             where patient=full.patient
				   and testdate-full.drugclaimdate between -30 and 0) then '1'
	when exists(select * from Example1 tests
	             where patient=full.patient
				   and testdate-full.drugclaimdate between 1 and 30) then '2'
	else '0'
  end as Interval30
  from example1 full
  order by patient,drugclaimdate,testdate;

But if your input table is very large, some of the other solutions shown may work faster (perhaps, I have not tested).

You may want to adjust the interval specifications according to your exact demands (you do not specify how to classify records with a test on the same day as the drug claim).

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @s_lassen 

 

I was curious, so I did the test on a similar data set with 9999 patients. The solution with Proc Transpose + Data Step used 0.2 second in total (0.1 second CPU), while the Proc SQL solution used more than half an hour CPU time. So while SQL seems simpler to some, the costs can be prohibitive.

 

But in a choice between several working solutions there are 3 factors to balance against each other in the given situation: 

  • Is the best solution the least ressource-consuming? 
  • or the solution that is easiest to understand and manitain?
  • or the solution using tools that one is accustomed to and know by heart?

 

259  proc sql;
260  create table want as
261    select
262      *,
263      case
264        when Drugclaimdate is null then ' '
265          when exists(
266            select * from Example1 as tests
267              where
268                patient=full.patient
269                and
270                testdate - full.drugclaimdate between -30 and 0
271          ) then '1'
272          when exists(
273            select * from Example1 as tests
274              where
275                patient=full.patient
276                and
277                testdate - full.drugclaimdate between 1 and 30
278          ) then '2'
279        else '0'
280      end as Interval30
281    from example1 as full
282    order by patient, drugclaimdate, testdate;
NOTE: Table WORK.WANT created, with 149985 rows and 7 columns.

283    quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           34:17.53
      cpu time            32:49.67


284
285    proc transpose
286      data=example1 (keep=Patient Testdate where=(not missing(testdate)))
287      out=test (drop=_name_);
288    ;
289    by Patient;
290  run;

NOTE: There were 29997 observations read from the data set WORK.EXAMPLE1.
      WHERE not MISSING(testdate);
NOTE: The data set WORK.TEST has 19998 observations and 3 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
      real time           0.09 seconds
      cpu time            0.04 seconds


291
292  * Reproduces your wanted output 1:1;
293  data want (drop=i prev post COL:);
294    merge example1 test;
295    by Patient;
296    array test COL:;
297    call sortn(of test{*});
298    if missing(Testdate) then do i = 1 to dim(test);
299      if not missing(test{i}) then do;
300        if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1;
301        if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2;
302       end;
303    end;
304    Interval30 = max(0, prev, post);
305  run;

NOTE: There were 149985 observations read from the data set WORK.EXAMPLE1.
NOTE: There were 19998 observations read from the data set WORK.TEST.
NOTE: The data set WORK.WANT has 149985 observations and 7 variables.
NOTE: DATA statement used (Total process time):
      real time           0.09 seconds
      cpu time            0.06 seconds
s_lassen
Meteorite | Level 14

You may be able to speed up the SQL solution by indexing your data, e.g.

proc sql;
  create index patient on example1(patient):
quit;

or perhaps:

proc sql;
  create index test_idx on example1(patient,testdate):
quit;

With the latter index it may (depending on the savvy of the SQL interpreter), further improvements may also be possible by changing the EXISTS conditions to something like

where patient=full.patient
  and testdate between full.drugclaimdate+1 and  full.drugclaimdate+30

But if the performance is an issue, one of the other solutions may be better.

SAS Innovate 2025: Call for Content

Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!

Submit your idea!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 1259 views
  • 4 likes
  • 6 in conversation