BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
jspoend
Obsidian | Level 7

Hi there, 

 

I have two datasets:

1) all health care claims (drugs) of women who delivered during the study period (2015-2021)

2) pregnancy file 

 

The two files are structured as follows: 

 

Claims file: 

ID date_claim claim_code 

1 1JAN2015 ATC1

1 20FEB2015 ATC2

1 15JUL2016 ATC3

1 20SEP2017 ATC2

2 3JAN2017 ATC7

2 5FEB2018 ATC1

2 8MAR2019 ATC11

2 15AUG2020 ATC12

2 20DEC2021 ATC11

 

Pregnancy file: 

ID pregnancy_start pregnancy_end

1 29JAN2017 28OCT2017

2 30JAN2018 15OCT2018

2 25JUN2020 15MAR2021

 

My goal is to 1) combine the two datasets and to 2) exclude all claims that were not billed during a pregnancy and to 3) create a new ID per pregnancy (bc one patient may have contributed several pregnancies - so the analytic unit will be pregnancy not ID).

 

I'm not sure how to go about this - I'm mainly struggling with step 2. My first attempt was to combine the two datasets with a set command and then to impute the pregnancy_start and pregnancy_end for all lines in the dataset using the retain fuction by ID. However, given that one patient can contribute several pregnancies that does not work. I'm a bit stuck right now, so any input is appreciated. 

 

Thanks a lot in advance! Julia

1 ACCEPTED SOLUTION

Accepted Solutions
mkeintz
PROC Star

This is untested, in the absence of sample data in the form of a working DATA step.

 

data want (drop=_:);
  set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
      claims    (keep=id date_claim      rename=(date_claim=_refdate)      in=inclaim);

  by id _refdate;
  if inpreg then do;
    if id=lag(id) then preg_seq+1;
    else preg_seq=1;
    set pregnancy;
  end;
  if inclaim=1 then do;
    set claims;
    if pregnancy_start<=date_claim<=pregnancy_end then output;
  end;
  if last.id then call missing(of _all_);    *Added later, see note **;
run;

This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.

 

The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.

 

If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation.  The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs.  This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).

 

But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.

 

NOTE:  I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

7 REPLIES 7
mkeintz
PROC Star

This is untested, in the absence of sample data in the form of a working DATA step.

 

data want (drop=_:);
  set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
      claims    (keep=id date_claim      rename=(date_claim=_refdate)      in=inclaim);

  by id _refdate;
  if inpreg then do;
    if id=lag(id) then preg_seq+1;
    else preg_seq=1;
    set pregnancy;
  end;
  if inclaim=1 then do;
    set claims;
    if pregnancy_start<=date_claim<=pregnancy_end then output;
  end;
  if last.id then call missing(of _all_);    *Added later, see note **;
run;

This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.

 

The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.

 

If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation.  The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs.  This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).

 

But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.

 

NOTE:  I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.  

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
jspoend
Obsidian | Level 7
Hi,

Thanks a lot for this, I just implemented the code and it seems to do what I want it to do. Many thanks! That made my life a lot easier:)
jspoend
Obsidian | Level 7
Thanks Garnet, for the later added line, that did in fact sovle a problem I found in my data!
Kurt_Bremser
Super User

Simple solution: first, create the sequence number, then join with SQL:

data claims;
input ID $ date_claim :date9. claim_code $;
format date_claim yymmdd10.;
datalines; 
1 1JAN2015 ATC1
1 20FEB2015 ATC2
1 15JUL2016 ATC3
1 20SEP2017 ATC2
2 3JAN2017 ATC7
2 5FEB2018 ATC1
2 8MAR2019 ATC11
2 15AUG2020 ATC12
2 20DEC2021 ATC11
;

data pregnancy;
input ID $ (pregnancy_start pregnancy_end) (:date9.);
format pregnancy_start pregnancy_end yymmdd10.;
datalines;
1 29JAN2017 28OCT2017
2 30JAN2018 15OCT2018
2 25JUN2020 15MAR2021
;

data pregnancy_seq;
set pregnancy;
by id;
if first.id
then seq = 1;
else seq + 1;
run;

proc sql;
create table want as
  select
    t1.*,
    t2.date_claim,
    t2.claim_code
  from pregnancy t1 left join claims t2
  on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;
jspoend
Obsidian | Level 7
Hi Kurt,

Thanks a lot for the fast reply. I'm struggling to understand: how would I combine the two if I only create the sequence number in the pregnancy file? Then I could still not identify which claims from the claims file belong to this pregnancy, or do I misunderstand?
Kurt_Bremser
Super User

I made a mistake when posting, the SQL needs to use the "sequenced" dataset:

proc sql;
create table want as
  select
    t1.*,
    t2.date_claim,
    t2.claim_code
  from pregnancy_seq t1 left join claims t2
  on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;

Both queries create the same observations, just that the sequence number was missing in the previous code. But you could still identify pregnancies by start and end dates.

jspoend
Obsidian | Level 7
Thanks a lot, I'll implement this tomorrow to check if it works in my dataset! Always great to learn something new!

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 1188 views
  • 1 like
  • 3 in conversation