Hi there,
I have two datasets:
1) all health care claims (drugs) of women who delivered during the study period (2015-2021)
2) pregnancy file
The two files are structured as follows:
Claims file:
ID date_claim claim_code
1 1JAN2015 ATC1
1 20FEB2015 ATC2
1 15JUL2016 ATC3
1 20SEP2017 ATC2
2 3JAN2017 ATC7
2 5FEB2018 ATC1
2 8MAR2019 ATC11
2 15AUG2020 ATC12
2 20DEC2021 ATC11
Pregnancy file:
ID pregnancy_start pregnancy_end
1 29JAN2017 28OCT2017
2 30JAN2018 15OCT2018
2 25JUN2020 15MAR2021
My goal is to 1) combine the two datasets and to 2) exclude all claims that were not billed during a pregnancy and to 3) create a new ID per pregnancy (bc one patient may have contributed several pregnancies - so the analytic unit will be pregnancy not ID).
I'm not sure how to go about this - I'm mainly struggling with step 2. My first attempt was to combine the two datasets with a set command and then to impute the pregnancy_start and pregnancy_end for all lines in the dataset using the retain fuction by ID. However, given that one patient can contribute several pregnancies that does not work. I'm a bit stuck right now, so any input is appreciated.
Thanks a lot in advance! Julia
This is untested, in the absence of sample data in the form of a working DATA step.
data want (drop=_:);
set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
claims (keep=id date_claim rename=(date_claim=_refdate) in=inclaim);
by id _refdate;
if inpreg then do;
if id=lag(id) then preg_seq+1;
else preg_seq=1;
set pregnancy;
end;
if inclaim=1 then do;
set claims;
if pregnancy_start<=date_claim<=pregnancy_end then output;
end;
if last.id then call missing(of _all_); *Added later, see note **;
run;
This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.
The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.
If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation. The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs. This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).
But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.
NOTE: I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.
This is untested, in the absence of sample data in the form of a working DATA step.
data want (drop=_:);
set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
claims (keep=id date_claim rename=(date_claim=_refdate) in=inclaim);
by id _refdate;
if inpreg then do;
if id=lag(id) then preg_seq+1;
else preg_seq=1;
set pregnancy;
end;
if inclaim=1 then do;
set claims;
if pregnancy_start<=date_claim<=pregnancy_end then output;
end;
if last.id then call missing(of _all_); *Added later, see note **;
run;
This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.
The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.
If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation. The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs. This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).
But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.
NOTE: I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.
Simple solution: first, create the sequence number, then join with SQL:
data claims;
input ID $ date_claim :date9. claim_code $;
format date_claim yymmdd10.;
datalines;
1 1JAN2015 ATC1
1 20FEB2015 ATC2
1 15JUL2016 ATC3
1 20SEP2017 ATC2
2 3JAN2017 ATC7
2 5FEB2018 ATC1
2 8MAR2019 ATC11
2 15AUG2020 ATC12
2 20DEC2021 ATC11
;
data pregnancy;
input ID $ (pregnancy_start pregnancy_end) (:date9.);
format pregnancy_start pregnancy_end yymmdd10.;
datalines;
1 29JAN2017 28OCT2017
2 30JAN2018 15OCT2018
2 25JUN2020 15MAR2021
;
data pregnancy_seq;
set pregnancy;
by id;
if first.id
then seq = 1;
else seq + 1;
run;
proc sql;
create table want as
select
t1.*,
t2.date_claim,
t2.claim_code
from pregnancy t1 left join claims t2
on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;
I made a mistake when posting, the SQL needs to use the "sequenced" dataset:
proc sql;
create table want as
select
t1.*,
t2.date_claim,
t2.claim_code
from pregnancy_seq t1 left join claims t2
on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;
Both queries create the same observations, just that the sequence number was missing in the previous code. But you could still identify pregnancies by start and end dates.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.