data example1;
input Patient $ Testname $ Testdate MMDDYY8. claim $ Drugclaimdate MMDDYY8.;
cards;
patienta a 03/01/20 ' ' .
patienta ' ' . druga 03/02/20
patienta ' ' . druga 03/28/20
patienta ' ' . druga 09/15/20
patientb a 03/01/20 ' ' .
patientb b 06/01/20 ' ' .
patientb ' ' . druga 02/28/20
patientb ' ' . drugb 03/05/20
patientb ' ' . druga 04/25/20
patientb ' ' . drugb 06/17/20
patientb ' ' . druga 07/05/20
patientb ' ' . druga 09/18/20
patientc ' ' . druga 03/01/20
patientc ' ' . drugb 04/01/20
patientc ' ' . drugc 05/01/20
;
run;
Hello dear SAS community
I need some help with the following question. I have a dataset with patients and their drug claims. Moreover, in this dataset are certain laboratory tests of the patients. I would like to make a variable interval30 with the values
Interval30=1 if the drug claim is within 30 days before a test of this patient
Interval30=2 if the drug claim is within 30 days after a test of this patient
Interval30=0 if no drug claim is within 30 days before or after a test of this patient.
One person can have only lab test, only drug claims or both (in any number).
The resulting dataset should look like the following.
Patient | Testname | Testdate ddmmyy | Drugclaim | Drugclaimdate ddmmyy | Interval30 |
patienta | a | 01/03/20 |
|
|
|
patienta |
|
| Druga | 02/03/20 | 2 |
patienta |
|
| Druga | 28/03/20 | 2 |
patienta |
|
| Druga | 15/09/20 | 0 |
patientb | a | 01/03/20 |
|
|
|
patientb | b | 01/06/20 |
|
|
|
patientb |
|
| Druga | 28/02/20 | 1 |
patientb |
|
| drugb | 05/03/20 | 2 |
patientb |
|
| Druga | 25/04/20 | 0 |
patientb |
|
| drugb | 17/06/20 | 2 |
patientb |
|
| Druga | 05/07/20 | 0 |
patientb |
|
| Druga | 18/09/20 | 0 |
patientc |
|
| Druga | 01/03/20 | 0 |
patientc |
|
| Drugb | 01/04/20 | 0 |
patientc |
|
| drugc | 01/05/20 | 0 |
Thank you for your help!
Hello @Nina4,
Let me first improve your data step creating the sample data:
data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;
Variable interval30 could be derived from the elements of a temporary array which is indexed with SAS date values and populated with the values 1 or 2 depending on the test dates of the current patient. Example:
data want(drop=i);
set example1;
by patient;
array _t[0:%sysfunc(today())] _temporary_;
if first.patient then call missing(of _t[*]);
if n(testdate) then do i=0 to 30;
_t[testdate-i]=1;
_t[testdate+i]=2;
end;
if n(drugclaimdate) then interval30=max(_t[drugclaimdate],0);
run;
This works for your example data. The array _t above covers the dates between 0 = '01JAN1960'd and today. Instead of 0 and %sysfunc(today()) you can use expressions like %sysevalf('01JAN2019'd) to limit the array to a different date range. Obviously, the data step assumes that dataset example1 is sorted by patient and that the observations with the relevant test dates precede the observations with the drug claim dates.
You need to decide what interval30 should be if the criteria for values 1 and 2 are both met. For example, this occurs always (for a single test date) if drugclaimdate=testdate and the code above defines interval30=2 in this particular case, but this could be changed easily. In the case of overlapping 30-day ranges for different tests of the same patient, the suggested code overwrites the values of the patient's first test (in the dataset) with those of the subsequent tests (e.g., if test b of patientb occurred on 15 Apr 2020). Again, this could be changed to implement a different rule, e.g., always prioritizing value 2.
Why is %SYSFUNC needed in this code?
@PaigeMiller wrote:
Why is %SYSFUNC needed in this code?
Because you need an actual number when specifying the dimensions of an array. You cannot have a function call there.
Hi @Nina4
Here is a different approach. Test dates are transposed per patient and read into an array, and each claim is tested against the list of test dates for the current patient. The tests for days-before and days-after are stored i 2 different variables, and then the resulting Interval30 is computed.
In this code days-after (2) has priority, so 2 and 1 makes 2, but it is very easy to change, e.g. make a new category 3 if both days-before and days-after are true.
The first data step reproduces your wanted output, but I took the liberty to write an alternative data step containing claim-resords only with the test dates included, that caused the Interval30 variable to be set for the current claim as a suggestion, because I would prefer this structure as a base for further analysis.
data example1;
input Patient $ Testname $ Testdate :MMDDYY. claim $ Drugclaimdate :MMDDYY.;
format Testdate Drugclaimdate ddmmyy8.;
cards;
patienta a 03/01/20 . .
patienta . . druga 03/02/20
patienta . . druga 03/28/20
patienta . . druga 09/15/20
patientb a 03/01/20 . .
patientb b 06/01/20 . .
patientb . . druga 02/28/20
patientb . . drugb 03/05/20
patientb . . druga 04/25/20
patientb . . drugb 06/17/20
patientb . . druga 07/05/20
patientb . . druga 09/18/20
patientc . . druga 03/01/20
patientc . . drugb 04/01/20
patientc . . drugc 05/01/20
;
proc transpose
data=example1 (keep=Patient Testdate where=(not missing(testdate)))
out=test (drop=_name_);
;
by Patient;
run;
* Reproduces your wanted output 1:1;
data want (drop=i prev post COL:);
merge example1 test;
by Patient;
array test COL:;
call sortn(of test{*});
if missing(Testdate) then do i = 1 to dim(test);
if not missing(test{i}) then do;
if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1;
if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2;
end;
end;
Interval30 = max(0, prev, post);
run;
* An alternative;
data want2 (drop=i testname testdate prev post COL:);
merge example1 (where=(missing(Testdate))) test;
by Patient;
format testdate_before testdate_after ddmmyy10.;
array test COL:;
call sortn(of test{*});
do i = 1 to dim(test);
if not missing(test{i}) then do;
if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then do;
testdate_before = test{i};
prev = 1;
end;
if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then do;
testdate_after = test{i};
post = 2;
end;
end;
end;
Interval30 = max(0, prev, post);
run;
I think you get the simplest solution using SQL, e.g.
proc sql;
create table want as select
*,
case
when Drugclaimdate is null then ' '
when exists(select * from Example1 tests
where patient=full.patient
and testdate-full.drugclaimdate between -30 and 0) then '1'
when exists(select * from Example1 tests
where patient=full.patient
and testdate-full.drugclaimdate between 1 and 30) then '2'
else '0'
end as Interval30
from example1 full
order by patient,drugclaimdate,testdate;
But if your input table is very large, some of the other solutions shown may work faster (perhaps, I have not tested).
You may want to adjust the interval specifications according to your exact demands (you do not specify how to classify records with a test on the same day as the drug claim).
Hi @s_lassen
I was curious, so I did the test on a similar data set with 9999 patients. The solution with Proc Transpose + Data Step used 0.2 second in total (0.1 second CPU), while the Proc SQL solution used more than half an hour CPU time. So while SQL seems simpler to some, the costs can be prohibitive.
But in a choice between several working solutions there are 3 factors to balance against each other in the given situation:
259 proc sql; 260 create table want as 261 select 262 *, 263 case 264 when Drugclaimdate is null then ' ' 265 when exists( 266 select * from Example1 as tests 267 where 268 patient=full.patient 269 and 270 testdate - full.drugclaimdate between -30 and 0 271 ) then '1' 272 when exists( 273 select * from Example1 as tests 274 where 275 patient=full.patient 276 and 277 testdate - full.drugclaimdate between 1 and 30 278 ) then '2' 279 else '0' 280 end as Interval30 281 from example1 as full 282 order by patient, drugclaimdate, testdate; NOTE: Table WORK.WANT created, with 149985 rows and 7 columns. 283 quit; NOTE: PROCEDURE SQL used (Total process time): real time 34:17.53 cpu time 32:49.67 284 285 proc transpose 286 data=example1 (keep=Patient Testdate where=(not missing(testdate))) 287 out=test (drop=_name_); 288 ; 289 by Patient; 290 run; NOTE: There were 29997 observations read from the data set WORK.EXAMPLE1. WHERE not MISSING(testdate); NOTE: The data set WORK.TEST has 19998 observations and 3 variables. NOTE: PROCEDURE TRANSPOSE used (Total process time): real time 0.09 seconds cpu time 0.04 seconds 291 292 * Reproduces your wanted output 1:1; 293 data want (drop=i prev post COL:); 294 merge example1 test; 295 by Patient; 296 array test COL:; 297 call sortn(of test{*}); 298 if missing(Testdate) then do i = 1 to dim(test); 299 if not missing(test{i}) then do; 300 if Drugclaimdate >= test{i}-30 and Drugclaimdate <= test{i} then prev = 1; 301 if Drugclaimdate >= test{i} and Drugclaimdate <= test{i}+30 then post = 2; 302 end; 303 end; 304 Interval30 = max(0, prev, post); 305 run; NOTE: There were 149985 observations read from the data set WORK.EXAMPLE1. NOTE: There were 19998 observations read from the data set WORK.TEST. NOTE: The data set WORK.WANT has 149985 observations and 7 variables. NOTE: DATA statement used (Total process time): real time 0.09 seconds cpu time 0.06 seconds
You may be able to speed up the SQL solution by indexing your data, e.g.
proc sql;
create index patient on example1(patient):
quit;
or perhaps:
proc sql;
create index test_idx on example1(patient,testdate):
quit;
With the latter index it may (depending on the savvy of the SQL interpreter), further improvements may also be possible by changing the EXISTS conditions to something like
where patient=full.patient
and testdate between full.drugclaimdate+1 and full.drugclaimdate+30
But if the performance is an issue, one of the other solutions may be better.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.