Solved: Claims file with several lines per ID: identify claims during pregnanc...

jspoend · Posted 04-13-2022 10:12 AM

Hi there,

I have two datasets:

1) all health care claims (drugs) of women who delivered during the study period (2015-2021)

2) pregnancy file

The two files are structured as follows:

Claims file:

ID date_claim claim_code

1 1JAN2015 ATC1

1 20FEB2015 ATC2

1 15JUL2016 ATC3

1 20SEP2017 ATC2

2 3JAN2017 ATC7

2 5FEB2018 ATC1

2 8MAR2019 ATC11

2 15AUG2020 ATC12

2 20DEC2021 ATC11

Pregnancy file:

ID pregnancy_start pregnancy_end

1 29JAN2017 28OCT2017

2 30JAN2018 15OCT2018

2 25JUN2020 15MAR2021

My goal is to 1) combine the two datasets and to 2) exclude all claims that were not billed during a pregnancy and to 3) create a new ID per pregnancy (bc one patient may have contributed several pregnancies - so the analytic unit will be pregnancy not ID).

I'm not sure how to go about this - I'm mainly struggling with step 2. My first attempt was to combine the two datasets with a set command and then to impute the pregnancy_start and pregnancy_end for all lines in the dataset using the retain fuction by ID. However, given that one patient can contribute several pregnancies that does not work. I'm a bit stuck right now, so any input is appreciated.

Thanks a lot in advance! Julia

mkeintz · Posted 04-13-2022 10:32 AM

This is untested, in the absence of sample data in the form of a working DATA step.

data want (drop=_:);
  set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
      claims    (keep=id date_claim      rename=(date_claim=_refdate)      in=inclaim);

  by id _refdate;
  if inpreg then do;
    if id=lag(id) then preg_seq+1;
    else preg_seq=1;
    set pregnancy;
  end;
  if inclaim=1 then do;
    set claims;
    if pregnancy_start<=date_claim<=pregnancy_end then output;
  end;
  if last.id then call missing(of _all_);    *Added later, see note **;
run;

This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.

The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.

If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation. The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs. This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).

But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.

NOTE: I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 04-13-2022 10:32 AM

This is untested, in the absence of sample data in the form of a working DATA step.

data want (drop=_:);
  set pregnancy (keep=id pregnancy_start rename=(pregnancy_start=_refdate) in=inpreg)
      claims    (keep=id date_claim      rename=(date_claim=_refdate)      in=inclaim);

  by id _refdate;
  if inpreg then do;
    if id=lag(id) then preg_seq+1;
    else preg_seq=1;
    set pregnancy;
  end;
  if inclaim=1 then do;
    set claims;
    if pregnancy_start<=date_claim<=pregnancy_end then output;
  end;
  if last.id then call missing(of _all_);    *Added later, see note **;
run;

This assumes that dataset PREGNANCY is sorted by ID/PREGNANCY_START and CLAIMS is sorted by ID/DATE_CLAIM.

The first set statement establishes a sorted interleaving of the observations from the two data sets, but keeps only the two variables needed to validate the sort order.

If the two-variable obs-in-hand is a pregrancy obs, then SET the entire observation. The pregnancy variables will be retained across many CLAIMS obs until the next pregnancy obs. This is also an opportunity to make the new PREG_SEQ variable (for first pregnancy, second, .... etc.).

But if the two-variable obs-in-hand is a CLAIMS obs, then read in the whole obs, check its date against the pregnancy dates, and output accordingly.

NOTE: I added the "if last.id then call missing(of _all_);" to prevent a pre-first-pregnancy claim from a given ID from accidentally qualifying in the date range of the last pregnancy of the prior ID.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

jspoend · Posted 04-13-2022 11:18 AM

Hi,

Thanks a lot for this, I just implemented the code and it seems to do what I want it to do. Many thanks! That made my life a lot easier:)

jspoend · Posted 05-02-2022 06:37 AM

Thanks Garnet, for the later added line, that did in fact sovle a problem I found in my data!

Kurt_Bremser · Posted 04-13-2022 10:57 AM

Simple solution: first, create the sequence number, then join with SQL:

data claims;
input ID $ date_claim :date9. claim_code $;
format date_claim yymmdd10.;
datalines; 
1 1JAN2015 ATC1
1 20FEB2015 ATC2
1 15JUL2016 ATC3
1 20SEP2017 ATC2
2 3JAN2017 ATC7
2 5FEB2018 ATC1
2 8MAR2019 ATC11
2 15AUG2020 ATC12
2 20DEC2021 ATC11
;

data pregnancy;
input ID $ (pregnancy_start pregnancy_end) (:date9.);
format pregnancy_start pregnancy_end yymmdd10.;
datalines;
1 29JAN2017 28OCT2017
2 30JAN2018 15OCT2018
2 25JUN2020 15MAR2021
;

data pregnancy_seq;
set pregnancy;
by id;
if first.id
then seq = 1;
else seq + 1;
run;

proc sql;
create table want as
  select
    t1.*,
    t2.date_claim,
    t2.claim_code
  from pregnancy t1 left join claims t2
  on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jspoend · Posted 04-13-2022 11:09 AM

Hi Kurt,

Thanks a lot for the fast reply. I'm struggling to understand: how would I combine the two if I only create the sequence number in the pregnancy file? Then I could still not identify which claims from the claims file belong to this pregnancy, or do I misunderstand?

Kurt_Bremser · Posted 04-13-2022 11:16 AM

I made a mistake when posting, the SQL needs to use the "sequenced" dataset:

proc sql;
create table want as
  select
    t1.*,
    t2.date_claim,
    t2.claim_code
  from pregnancy_seq t1 left join claims t2
  on t1.id = t2.id and t1.pregnancy_start le t2.date_claim le t1.pregnancy_end
;
quit;

Both queries create the same observations, just that the sequence number was missing in the previous code. But you could still identify pregnancies by start and end dates.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

jspoend · Posted 04-13-2022 11:20 AM

Thanks a lot, I'll implement this tomorrow to check if it works in my dataset! Always great to learn something new!

Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Re: Claims file with several lines per ID: identify claims during pregnancy defined in separate file

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away