Hello,
I am working with medical claims data. The dataset contains observations (claims, which represent an encounter with a medical professional) from two sources: "A" which are outpatient encounters (i.e., doctors visits) and "B" which are records of inpatient hospital admissions. The "A" outpatient dataset is further divided into type: "2" which are emergency room visits and "3" which are regular doctors visits (inpatient records are type "1").
The problem I am encountering is that each day of the hospital stay in the "B" inpatient data source is also listed as a outpatient doctor's visit in the "A" outpatient dataset. If the hospital admission included an emergency room visit, there is also a record in "A" for that. So, in that case there would be three records for the first day (one inpatient, one emergency, one outpatient) and one outpatient record for each subsequent day of the hospital stay. I want to keep the inpatient and emergency room records, while deleting the outpatient records associated with the hospital stay (but keeping other unrelated outpatient records).
I know this is complicated so here is an example:
data have;
input source id type begin_date end_date
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .
;
run;
This is how I would want the end result to look:
data want;
input source id type begin_date end_date
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
;
run;
I have tried for a long time to figure this out using the lag command, but without luck so far. Any help would be greatly appreciated! Thank you.
Adding one line to what I originally sent you should do the trick.
proc sql; create table joined as select h.* ,o.begin_date as inpatient_begin_date ,o.end_date as inpatient_end_date from have h left join inpatient o on h.id = o.id and h.begin_date between o.begin_date and o.end_date order by h.id, h.begin_date, h.type; quit;
Output:
The "B" outpatient dataset is further divided into type: "2" which are emergency room visits and "3" which are regular doctors visits.
Since your sample data has types 2 and 3 for A, I am assuming you swapped "A" and "B" in your data (so that "A" is outpatient and "B" is inpatient).
With that assumption, your question and data seem to make sense. See if this is what you're looking for:
data have;
input source $ id $ type $ begin_date :date9. end_date :date9.;
format begin_date end_date date9.;
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .
;
run;
data inpatient;
set have (where=(source='B'));
run;
proc sql;
create table joined as
select h.*
,o.begin_date as inpatient_begin_date
,o.end_date as inpatient_end_date
from have h
left join inpatient o
on h.id = o.id
order by h.id, h.begin_date, h.type;
quit;
data want (drop=inpatient:);
set joined;
if not(source = 'A'
and type = '3'
and inpatient_begin_date le begin_date le inpatient_end_date)
then output;
run;
Thank you for the quick response.
Your assumption is correct, that was a typo. I have corrected it above.
I ran this and all it seemed to do was create duplicates of each observation. But thank you for trying! I wish I could explain why but I have a hard time understanding exactly what you did here.
Where are you seeing duplicated observations?
Using the exact code I supplied above, if you run this:
proc print data=want; run;
It results in this, which appears to match your desired output:
My apologies, I should have stated that yes, this works written exactly as above. My problem is that when I revise the code to fit my actual dataset, it does not work. The problem appears to be because a patient can have more than one inpatient stay (which wasn't indicated in my original post). This causes the inpatient_begin_date and inpatient_end_date columns to not function as intended. For example, if a patient has two hospital stays, the inpatient_begin_date and inpatient_end_date columns will have alternating values between the first stay and second stay. I hope this makes sense.
To better illustrate, here is a revised example:
data have;
input source id type begin_date end_date
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .
B 2 1 05FEB2019 07FEB2019
A 2 2 05FEB2019 .
A 2 3 05FEB2019 .
A 2 3 06FEB2019 .
A 2 3 07FEB2019 .
;
run;
This is how I would want the end result to look:
data want;
input source id type begin_date end_date
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
B 2 1 05FEB2019 07FEB2019
A 2 2 05FEB2019 .
;
run;
Adding one line to what I originally sent you should do the trick.
proc sql; create table joined as select h.* ,o.begin_date as inpatient_begin_date ,o.end_date as inpatient_end_date from have h left join inpatient o on h.id = o.id and h.begin_date between o.begin_date and o.end_date order by h.id, h.begin_date, h.type; quit;
Output:
Since your data are sorted by ID/BEGIN_DATE (but not neccessarily sorted by source within date), you can solve this in one data step:
data have;
input source $ id $ type $ begin_date :date9. end_date :date9.;
format begin_date end_date date9.;
datalines;
A 1 3 28JAN2019 .
B 1 1 29JAN2019 04FEB2019
A 1 2 29JAN2019 .
A 1 3 29JAN2019 .
A 1 3 30JAN2019 .
A 1 3 31JAN2019 .
A 1 3 01FEB2019 .
A 1 3 02FEB2019 .
A 1 3 03FEB2019 .
A 1 3 04FEB2019 .
A 1 3 05FEB2019 .
A 2 3 28JAN2019 .
A 2 3 30JAN2019 .
A 2 3 31JAN2019 .
B 2 1 01FEB2019 03FEB2019
A 2 2 01FEB2019 .
A 2 3 01FEB2019 .
A 2 3 02FEB2019 .
A 2 3 03FEB2019 .
run;
data want (drop=_:);
set have (where=(source='B'))
have (where=(source^='B'));
by id begin_date;
array history {%sysevalf("01jan2018"d):%sysevalf("31dec2020"d)} _temporary_;
if first.id then call missing(of history{*});
if source='B' then do;
output;
do _d=begin_date to end_date; history{_d}=1; end;
end;
else if type='2' then output;
else if history{begin_date}=. then output;
history{begin_date}=1;
run;
BTW, if you are likely to have two A records (one regular followed by one emergency) that does not overlap a B record, this will produce two A records. If you want to force it to output only 1 A record in this case, change the SET statement to:
set have (where=(source='B'))
have (where=(source='A' and type='2'))
have (where=(not(source='B' or (source='A' and type='2')))) ;
which will force, for any given id/date, the B record to be first, followed by the A2, followed by all other records.
Of course you can change the date range of the history array to accommodate your data.
I can confirm that both methods work! Thanks to you both for solving this issue!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.