Re: How do I delete observations with dates occurring within another d...

aokolo

Hello!

I am working with a healthcare dataset looking at hospital inpatient stays. I am working on cleaning and deduplicating this dataset but run into an issue where I have some stays that occur during another longer stay for the same case. For the analysis I want to do I only want to include the initial stay and not the stays that occur within. Is there a way to either label or delete these stays that occur within a longer stay? I appreciate any help I can get. Thank you! Example below:

This is what my data looks like:

ID	AdmitDate	DischargeDate
1	1/1/2025	2/4/2025
1	1/3/2025	1/3/2025
1	1/17/2025	1/17/2025
2	3/4/2025	3/4/2025
3	6/7/2025	6/9/2025
3	6/8/2025	6/8/2025

And this is what I would ultimately like:

ID	AdmitDate	DischargeDate	Delete
1	1/1/2025	2/4/2025
1	1/3/2025	1/3/2025	1
1	1/17/2025	1/17/2025	1
2	3/4/2025	3/4/2025
3	6/7/2025	6/9/2025
3	6/8/2025	6/8/2025	1

ballardw

Dealing with such overlapping intervals you may have to provide some additional rules. Such as this case where a second starts within another interval but extends past the discharge of a previous interval.

ID	AdmitDate	DischargeDate
1	1/1/2025	2/4/2025
1	1/17/2025	3/23/2025

What would you do in this case?

If, and this is a pretty big if in some situations you actually do only want the "initial", defined as the earliest admit date then on approach might be to sort but ID and Admit date.

Then with the sorted data you might be able to use something like this:

Data want;
   set datasorted;
   by id admitdate;
   if first.id;
run;

If the data is sorted by id and admitdate, assumes admitdate is a SAS date value numeric and hopefully with a date type format applied. The BY statement creates automatic variables for each variable that are referenced as First.variablename or Last.variable which are numeric 1/0 valued variables that SAS will treat as true/false to test if a value is the first or last of a given by group. So the IF First.ID, a subsetting IF, keeps only the observations with the first occurrence of an ID value.

However the above likely will not work if there are multiple admit sets, such as ID 1 having another admit 05/05/2025. So a more complete description of likely scenarios and a few more rules are needed.

aokolo

Thank you for taking the time to respond to my question. In response to your example of date overlap, we would conduct a manual review to see what is going on with these two admissions, then decide which one to remove. What is happening in this dataset is there are claims that occur during an inpatient stay (for procedures, services, etc) and those are given different lines in the data on the day that they occur within the inpatient stay. Since we only want to count the inpatient stays and we're not yet looking at services, I wanted a way to flag and remove those lines. And you're correct, the same ID could have multiple inpatient stays that we would want to count so sorting by ID and admitdate would remove other inpatient stays we'd need.

Tom

If you know which records are the real ones why not just keep those?

I think the test is just to eliminate those that fall completely within a longer stay.

data have;
  input ID (AdmitDate DischargeDate) (:mmddyy.);
  format AdmitDate DischargeDate yymmdd10.;
cards;
1 1/1/2025  2/4/2025
1 1/3/2025  1/3/2025
1 1/17/2025 1/17/2025
2 3/4/2025  3/4/2025
3 6/7/2025  6/9/2025
3 6/8/2025  6/8/2025
;

proc print;
run;

proc sql;
create table want as 
select * from have
except 
  (select a.* from have a 
   full join have b 
   on a.id=b.id
   and (a.AdmitDate ne b.AdmitDate or a.DischargeDate ne B.DischargeDate)
   where a.AdmitDate between b.AdmitDate and B.DischargeDate
   and a.DischargeDate between b.AdmitDate and B.DischargeDate
   )
;
quit;

proc print;
run;

But if this is claims data (or direct hospital records) then you probably cannot assume that there exists a single record with the FULL length of stay.

Ksharp

data have;
infile cards truncover;
input ID	(AdmitDate	DischargeDate) (: mmddyy10.);
format AdmitDate	DischargeDate mmddyy10.;
cards;
1	1/1/2025	2/4/2025
1	1/3/2025	1/3/2025
1	1/17/2025	1/17/2025
2	3/4/2025	3/4/2025
3	6/7/2025	6/9/2025
3	6/8/2025	6/8/2025
;

data temp;
 set have;
 do date=AdmitDate	to DischargeDate;
  output;
 end;
 keep id date;
 format date mmddyy10.;
run;
proc sort data=temp nodupkey;
by id date;
run;
data temp2;
 set temp;
 by id;
 if first.id or dif(date) ne 1 then group+1;
run;
proc summary data=temp2;
by id group;
var date;
output out=want(drop=_: group) min=AdmitDate max=DischargeDate ;
run;

Tom

You can perform that algorithm in a single pass (or two passes if you need to first calculate a reasonable MIN/MAX values for the date ranges).

First let's make some data that has non nested intervals and also multiple intervals for the same ID.

data have;
  input ID (AdmitDate DischargeDate) (:mmddyy.);
  format AdmitDate  DischargeDate yymmdd10.;
cards;
1 1/1/2025  2/4/2025
1 1/3/2025  1/3/2025
1 1/17/2025 1/17/2025
2 3/4/2025  3/4/2025
3 6/7/2025  6/9/2025
3 6/8/2025  6/8/2025
4 1/1/2025  2/4/2025
4 1/17/2025 3/23/2025
5 1/1/2025  2/4/2025
5 3/1/2025  4/1/2025
;

Now let's find the min and max dates to help set the possible date range.

proc sql noprint;
select min(min(AdmitDate,DischargeDate))
     , max(max(AdmitDate,DischargeDate))
  into :mind trimmed
     , :maxd trimmed
  from have
;
quit;
%let length=%eval(&maxd-&mind+1);
%put &=length (From %sysfunc(putn(&mind,yymmdd10.)) to %sysfunc(putn(&maxd,yymmdd10.)));

You can skip this step if you already know what values to use for MIND and LENGTH.

Now for each subject use your DAY loop to flag those days and then find the begin/end of each set of flagged days.

Here is an example using a simple character variable. Doing this makes it easy to use the CALL SCAN routine to find the periods.

data periods ;
  length days $&length.;
  do until(last.id);
    set have;
    by id;
    do day=admitdate to dischargedate;
      substr(days,day-&mind.+1,1)='1';
    end;
  end;
  period=1;
  position=1;
  do until(position=0);
    call scan(days,period,position,length,' ');   
    if position then do;
      AdmitDate = &mind.+position-1;
      DischargeDate = AdmitDate + length -1 ;
    end;
    if position or period=1 then output;
    period+1;
  end;
  drop days day position length ;
run;

Results

Note this could get slow if the range of dates is very long since SAS processes very long character variables slowly. In that case you should probably use an ARRAY instead a character variable. You could use some DO loops to find the periods of flagged days instead of the CALL SCAN method.

Ksharp

Tom,
I just want to keep my code simple and readable.
I really don't care about the code is long or short .
I think the short code would bring side effect . The readable code is the most important thing.

Tom

@Ksharp wrote:

Tom,
I just want to keep my code simple and readable.
I really don't care about the code is long or short .
I think the short code would bring side effect . The readable code is the most important thing.

The point was not readability, but performance. Your code made multiple passes through the data.

Ksharp

Tom,
As I said , the most important thing I care about is readable for code, not length, not performance . not others ....
Unless the running time of code is WAY too much I can't accept, in that case I would consider about enhancing the performance of code .
Your code is not easy to read than mine, don't you think so ?

aokolo

Thank you so much! Question, if this is allowed, if I wanted to keep dates that are the same as the admit date and just remove those that occur within the date window, how would I change this code to reflect that? I tried your code on my dataset and it worked but it also ended up removing ED visits that happened on the same day and resulted in the inpatient stay and I would like to keep those.

Ksharp

You want to keep the obs which have the same AdmitDate?

data have;
infile cards truncover;
input ID	(AdmitDate	DischargeDate) (: mmddyy10.);
format AdmitDate	DischargeDate mmddyy10.;
cards;
1	1/1/2025	2/4/2025
1	1/1/2025	4/4/2025
1	1/3/2025	1/3/2025
1	1/17/2025	1/17/2025
2	3/4/2025	3/4/2025
3	6/7/2025	6/9/2025
3	6/8/2025	6/8/2025
;
/*keep dates that are the same as the admit date*/
proc sql;
create table dup as
select * from have 
 group by id,AdmitDate
  having count(*)>1;
quit;


data temp;
 set have;
 do date=AdmitDate	to DischargeDate;
  output;
 end;
 keep id date;
 format date mmddyy10.;
run;
proc sort data=temp nodupkey;
by id date;
run;
data temp2;
 set temp;
 by id;
 if first.id or dif(date) ne 1 then group+1;
run;
proc summary data=temp2;
by id group;
var date;
output out=temp3(drop=_: group) min=AdmitDate max=DischargeDate ;
run;


/*combine them together*/
proc sql;
create table want as
select * from dup
union
select * from temp3;
quit;

How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Re: How do I delete observations with dates occurring within another date for the same ID?

Catch up on SAS Innovate 2026

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away