Solved: Removing multiple duplicate observations based on values in earlier ob...

twerwath · Posted 05-10-2021 12:15 PM

Hello,

I am working with medical claims data. The dataset contains observations (claims, which represent an encounter with a medical professional) from two sources: "A" which are outpatient encounters (i.e., doctors visits) and "B" which are records of inpatient hospital admissions. The "A" outpatient dataset is further divided into type: "2" which are emergency room visits and "3" which are regular doctors visits (inpatient records are type "1").

The problem I am encountering is that each day of the hospital stay in the "B" inpatient data source is also listed as a outpatient doctor's visit in the "A" outpatient dataset. If the hospital admission included an emergency room visit, there is also a record in "A" for that. So, in that case there would be three records for the first day (one inpatient, one emergency, one outpatient) and one outpatient record for each subsequent day of the hospital stay. I want to keep the inpatient and emergency room records, while deleting the outpatient records associated with the hospital stay (but keeping other unrelated outpatient records).

I know this is complicated so here is an example:

data have;
input source id type begin_date end_date

datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .

;

run;

This is how I would want the end result to look:

data want;
input source id type begin_date end_date

datalines;

A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .

;

run;

I have tried for a long time to figure this out using the lag command, but without luck so far. Any help would be greatly appreciated! Thank you.

mklangley · Posted 05-10-2021 04:51 PM

Adding one line to what I originally sent you should do the trick.

proc sql;
    create table joined as
    select h.*
          ,o.begin_date as inpatient_begin_date
          ,o.end_date as inpatient_end_date
    from have h
    left join inpatient o
        on h.id = o.id
        and h.begin_date between o.begin_date and o.end_date
    order by h.id, h.begin_date, h.type;
quit;

Output:

View solution in original post

mklangley · Posted 05-10-2021 02:10 PM

The "B" outpatient dataset is further divided into type: "2" which are emergency room visits and "3" which are regular doctors visits.

Since your sample data has types 2 and 3 for A, I am assuming you swapped "A" and "B" in your data (so that "A" is outpatient and "B" is inpatient).

With that assumption, your question and data seem to make sense. See if this is what you're looking for:

data have;
input source $ id $ type $ begin_date :date9. end_date :date9.;
format begin_date end_date date9.;
datalines;
A 1 3 28JAN2019 . 
B 1 1 29JAN2019 04FEB2019 
A 1 2 29JAN2019 . 
A 1 3 29JAN2019 . 
A 1 3 30JAN2019 . 
A 1 3 31JAN2019 . 
A 1 3 01FEB2019 . 
A 1 3 02FEB2019 . 
A 1 3 03FEB2019 . 
A 1 3 04FEB2019 . 
A 1 3 05FEB2019 . 
A 2 3 28JAN2019 . 
A 2 3 30JAN2019 . 
A 2 3 31JAN2019 . 
B 2 1 01FEB2019 03FEB2019 
A 2 2 01FEB2019 . 
A 2 3 01FEB2019 . 
A 2 3 02FEB2019 . 
A 2 3 03FEB2019 . 
;
run;

data inpatient;
    set have (where=(source='B'));
run;

proc sql;
    create table joined as
    select h.*
          ,o.begin_date as inpatient_begin_date
          ,o.end_date as inpatient_end_date
    from have h
    left join inpatient o
        on h.id = o.id
    order by h.id, h.begin_date, h.type;
quit;

data want (drop=inpatient:);
    set joined;
    if not(source = 'A'
           and type = '3'
           and inpatient_begin_date le begin_date le inpatient_end_date)
        then output;
run;

twerwath · Posted 05-10-2021 03:21 PM

Thank you for the quick response.

Your assumption is correct, that was a typo. I have corrected it above.

I ran this and all it seemed to do was create duplicates of each observation. But thank you for trying! I wish I could explain why but I have a hard time understanding exactly what you did here.

mklangley · Posted 05-10-2021 03:30 PM

Where are you seeing duplicated observations?

Using the exact code I supplied above, if you run this:

proc print data=want; run;

It results in this, which appears to match your desired output:

twerwath · Posted 05-10-2021 04:26 PM

My apologies, I should have stated that yes, this works written exactly as above. My problem is that when I revise the code to fit my actual dataset, it does not work. The problem appears to be because a patient can have more than one inpatient stay (which wasn't indicated in my original post). This causes the inpatient_begin_date and inpatient_end_date columns to not function as intended. For example, if a patient has two hospital stays, the inpatient_begin_date and inpatient_end_date columns will have alternating values between the first stay and second stay. I hope this makes sense.

To better illustrate, here is a revised example:

data have;
input source id type begin_date end_date

datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .

B 2 1 05FEB2019 07FEB2019
A 2 2 05FEB2019 .
A 2 3 05FEB2019 .
A 2 3 06FEB2019 .
A 2 3 07FEB2019 .

;

run;

This is how I would want the end result to look:

data want;
input source id type begin_date end_date

datalines;

A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .

B 2 1 05FEB2019 07FEB2019
A 2 2 05FEB2019 .

;

run;

mklangley · Posted 05-10-2021 04:51 PM

Adding one line to what I originally sent you should do the trick.

proc sql;
    create table joined as
    select h.*
          ,o.begin_date as inpatient_begin_date
          ,o.end_date as inpatient_end_date
    from have h
    left join inpatient o
        on h.id = o.id
        and h.begin_date between o.begin_date and o.end_date
    order by h.id, h.begin_date, h.type;
quit;

Output:

mkeintz · Posted 05-10-2021 04:44 PM

Since your data are sorted by ID/BEGIN_DATE (but not neccessarily sorted by source within date), you can solve this in one data step:

Read the data set as two intereleaved subsets ( where=(source='B') followed by where=(source^='B'). They are interleaved by ID/BEGIN_DATE with the "B" always preceding any A's.
Keep a history record of all dates encountered prior to the current record.
output all 'B'. and all 'A' type 2 (emergency) and all other A's in which the date has no record yet.

data have;
  input source $ id $ type $ begin_date :date9. end_date :date9.;
  format begin_date end_date date9.;
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 . 
A 1 3 30JAN2019 . 
A 1 3 31JAN2019 . 
A 1 3 01FEB2019 . 
A 1 3 02FEB2019 . 
A 1 3 03FEB2019 . 
A 1 3 04FEB2019 . 
A 1 3 05FEB2019 .
A 2 3 28JAN2019 . 
A 2 3 30JAN2019 . 
A 2 3 31JAN2019 . 
B 2 1 01FEB2019 03FEB2019 
A 2 2 01FEB2019 . 
A 2 3 01FEB2019 . 
A 2 3 02FEB2019 . 
A 2 3 03FEB2019 . 
run;


data want (drop=_:);
  set have (where=(source='B'))
      have (where=(source^='B'));
  by id begin_date;

  array history {%sysevalf("01jan2018"d):%sysevalf("31dec2020"d)} _temporary_;
  if first.id then call missing(of history{*});

  if source='B' then do;
    output;
    do _d=begin_date to end_date; history{_d}=1; end;
  end;
  else if type='2' then output;
  else if history{begin_date}=. then output;
  history{begin_date}=1;
run;

BTW, if you are likely to have two A records (one regular followed by one emergency) that does not overlap a B record, this will produce two A records. If you want to force it to output only 1 A record in this case, change the SET statement to:

 set have (where=(source='B'))
     have (where=(source='A' and type='2'))
     have (where=(not(source='B' or (source='A' and type='2')))) ;

which will force, for any given id/date, the B record to be first, followed by the A2, followed by all other records.

Of course you can change the date range of the history array to accommodate your data.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

twerwath · Posted 05-11-2021 08:55 AM

I can confirm that both methods work! Thanks to you both for solving this issue!

Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Re: Removing multiple duplicate observations based on values in earlier observations

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away