Hello!
I am working with a healthcare dataset looking at hospital inpatient stays. I am working on cleaning and deduplicating this dataset but run into an issue where I have some stays that occur during another longer stay for the same case. For the analysis I want to do I only want to include the initial stay and not the stays that occur within. Is there a way to either label or delete these stays that occur within a longer stay? I appreciate any help I can get. Thank you! Example below:
This is what my data looks like:
| ID | AdmitDate | DischargeDate |
| 1 | 1/1/2025 | 2/4/2025 |
| 1 | 1/3/2025 | 1/3/2025 |
| 1 | 1/17/2025 | 1/17/2025 |
| 2 | 3/4/2025 | 3/4/2025 |
| 3 | 6/7/2025 | 6/9/2025 |
| 3 | 6/8/2025 | 6/8/2025 |
And this is what I would ultimately like:
| ID | AdmitDate | DischargeDate | Delete |
| 1 | 1/1/2025 | 2/4/2025 | |
| 1 | 1/3/2025 | 1/3/2025 | 1 |
| 1 | 1/17/2025 | 1/17/2025 | 1 |
| 2 | 3/4/2025 | 3/4/2025 | |
| 3 | 6/7/2025 | 6/9/2025 | |
| 3 | 6/8/2025 | 6/8/2025 | 1 |
Dealing with such overlapping intervals you may have to provide some additional rules. Such as this case where a second starts within another interval but extends past the discharge of a previous interval.
| ID | AdmitDate | DischargeDate |
| 1 | 1/1/2025 | 2/4/2025 |
| 1 | 1/17/2025 | 3/23/2025 |
What would you do in this case?
If, and this is a pretty big if in some situations you actually do only want the "initial", defined as the earliest admit date then on approach might be to sort but ID and Admit date.
Then with the sorted data you might be able to use something like this:
Data want; set datasorted; by id admitdate; if first.id; run;
If the data is sorted by id and admitdate, assumes admitdate is a SAS date value numeric and hopefully with a date type format applied. The BY statement creates automatic variables for each variable that are referenced as First.variablename or Last.variable which are numeric 1/0 valued variables that SAS will treat as true/false to test if a value is the first or last of a given by group. So the IF First.ID, a subsetting IF, keeps only the observations with the first occurrence of an ID value.
However the above likely will not work if there are multiple admit sets, such as ID 1 having another admit 05/05/2025. So a more complete description of likely scenarios and a few more rules are needed.
Thank you for taking the time to respond to my question. In response to your example of date overlap, we would conduct a manual review to see what is going on with these two admissions, then decide which one to remove. What is happening in this dataset is there are claims that occur during an inpatient stay (for procedures, services, etc) and those are given different lines in the data on the day that they occur within the inpatient stay. Since we only want to count the inpatient stays and we're not yet looking at services, I wanted a way to flag and remove those lines. And you're correct, the same ID could have multiple inpatient stays that we would want to count so sorting by ID and admitdate would remove other inpatient stays we'd need.
If you know which records are the real ones why not just keep those?
I think the test is just to eliminate those that fall completely within a longer stay.
data have;
input ID (AdmitDate DischargeDate) (:mmddyy.);
format AdmitDate DischargeDate yymmdd10.;
cards;
1 1/1/2025 2/4/2025
1 1/3/2025 1/3/2025
1 1/17/2025 1/17/2025
2 3/4/2025 3/4/2025
3 6/7/2025 6/9/2025
3 6/8/2025 6/8/2025
;
proc print;
run;
proc sql;
create table want as
select * from have
except
(select a.* from have a
full join have b
on a.id=b.id
and (a.AdmitDate ne b.AdmitDate or a.DischargeDate ne B.DischargeDate)
where a.AdmitDate between b.AdmitDate and B.DischargeDate
and a.DischargeDate between b.AdmitDate and B.DischargeDate
)
;
quit;
proc print;
run;
But if this is claims data (or direct hospital records) then you probably cannot assume that there exists a single record with the FULL length of stay.
data have;
infile cards truncover;
input ID (AdmitDate DischargeDate) (: mmddyy10.);
format AdmitDate DischargeDate mmddyy10.;
cards;
1 1/1/2025 2/4/2025
1 1/3/2025 1/3/2025
1 1/17/2025 1/17/2025
2 3/4/2025 3/4/2025
3 6/7/2025 6/9/2025
3 6/8/2025 6/8/2025
;
data temp;
set have;
do date=AdmitDate to DischargeDate;
output;
end;
keep id date;
format date mmddyy10.;
run;
proc sort data=temp nodupkey;
by id date;
run;
data temp2;
set temp;
by id;
if first.id or dif(date) ne 1 then group+1;
run;
proc summary data=temp2;
by id group;
var date;
output out=want(drop=_: group) min=AdmitDate max=DischargeDate ;
run;
You can perform that algorithm in a single pass (or two passes if you need to first calculate a reasonable MIN/MAX values for the date ranges).
First let's make some data that has non nested intervals and also multiple intervals for the same ID.
data have;
input ID (AdmitDate DischargeDate) (:mmddyy.);
format AdmitDate DischargeDate yymmdd10.;
cards;
1 1/1/2025 2/4/2025
1 1/3/2025 1/3/2025
1 1/17/2025 1/17/2025
2 3/4/2025 3/4/2025
3 6/7/2025 6/9/2025
3 6/8/2025 6/8/2025
4 1/1/2025 2/4/2025
4 1/17/2025 3/23/2025
5 1/1/2025 2/4/2025
5 3/1/2025 4/1/2025
;
Now let's find the min and max dates to help set the possible date range.
proc sql noprint;
select min(min(AdmitDate,DischargeDate))
, max(max(AdmitDate,DischargeDate))
into :mind trimmed
, :maxd trimmed
from have
;
quit;
%let length=%eval(&maxd-&mind+1);
%put &=length (From %sysfunc(putn(&mind,yymmdd10.)) to %sysfunc(putn(&maxd,yymmdd10.)));
You can skip this step if you already know what values to use for MIND and LENGTH.
Now for each subject use your DAY loop to flag those days and then find the begin/end of each set of flagged days.
Here is an example using a simple character variable. Doing this makes it easy to use the CALL SCAN routine to find the periods.
data periods ;
length days $&length.;
do until(last.id);
set have;
by id;
do day=admitdate to dischargedate;
substr(days,day-&mind.+1,1)='1';
end;
end;
period=1;
position=1;
do until(position=0);
call scan(days,period,position,length,' ');
if position then do;
AdmitDate = &mind.+position-1;
DischargeDate = AdmitDate + length -1 ;
end;
if position or period=1 then output;
period+1;
end;
drop days day position length ;
run;
Results
Note this could get slow if the range of dates is very long since SAS processes very long character variables slowly. In that case you should probably use an ARRAY instead a character variable. You could use some DO loops to find the periods of flagged days instead of the CALL SCAN method.
Tom,
I just want to keep my code simple and readable.
I really don't care about the code is long or short .
I think the short code would bring side effect . The readable code is the most important thing.
@Ksharp wrote:
Tom,
I just want to keep my code simple and readable.
I really don't care about the code is long or short .
I think the short code would bring side effect . The readable code is the most important thing.
The point was not readability, but performance. Your code made multiple passes through the data.
You want to keep the obs which have the same AdmitDate?
data have; infile cards truncover; input ID (AdmitDate DischargeDate) (: mmddyy10.); format AdmitDate DischargeDate mmddyy10.; cards; 1 1/1/2025 2/4/2025 1 1/1/2025 4/4/2025 1 1/3/2025 1/3/2025 1 1/17/2025 1/17/2025 2 3/4/2025 3/4/2025 3 6/7/2025 6/9/2025 3 6/8/2025 6/8/2025 ; /*keep dates that are the same as the admit date*/ proc sql; create table dup as select * from have group by id,AdmitDate having count(*)>1; quit; data temp; set have; do date=AdmitDate to DischargeDate; output; end; keep id date; format date mmddyy10.; run; proc sort data=temp nodupkey; by id date; run; data temp2; set temp; by id; if first.id or dif(date) ne 1 then group+1; run; proc summary data=temp2; by id group; var date; output out=temp3(drop=_: group) min=AdmitDate max=DischargeDate ; run; /*combine them together*/ proc sql; create table want as select * from dup union select * from temp3; quit;
Dive into keynotes, announcements and breakthroughs on demand.
Explore Now →Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.