DATA Step, Macro, Functions and more

Fast way to remove overlapping date spans

Accepted Solution Solved
Reply
Contributor
Posts: 30
Accepted Solution

Fast way to remove overlapping date spans

Hello,

 

I have several datasets containing various start and end date information and need to remove overlapping dates (see sample code below). For instance, in the below sample, person A had a span of Jan 1 - Nov 30, 2016, and exclude from July 1 - Dec 31, 2016. The resulting start should be Jan 1, 2016 and the end should be June 30, 2016, a day before the exclude start date. Person B had a span of Jan 1 - Dec 2016 and an exclude of Sept 1 - Sept 30, 2016. So they would have two lines for two new start and end dates, the first from Jan 1 - Aug 31, 2016, and another from Oct 1 - Dec 31 2016.

 

Currently, I am doing this by outputting one row per day and using PROC SQL to exclude where ID and date overlap, then I combine consecutive days back into one line. This works, but the problem is that my datasets are much larger--the largest data views I create have 4-6 billion observations and I need to repeat this procedure several times for different datasets.

 

My question is whether there is a simple(ish) way that doesn't involve creating a row for each day of enrollment?

 

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;run;

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;run;

Thank you,

 

-Brian


Accepted Solutions
Solution
‎12-05-2017 09:00 AM
Respected Advisor
Posts: 3,866

Re: Fast way to remove overlapping date spans

This is similar to what you describe you are doing now.  I use 3 views and a PROC SUMMARY to output the ranges.  I tested with multiple exclude ranges for an ID but not when the exclude ranges overlap.  Requires that MASTER and EXCLUDE are ordered BY ID.  

 

data master;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
   run;
proc print;
   run;
data masterv / view=masterv;
   set master;
   do date = start to end;
      output;
      end;
   drop start end;
   format date mmddyy10.;
   run;

data exclude;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;
   run;
proc print;
   run;
data excludeV / view=excludeV;
   set exclude;
   do date = start to end;
      output;
      end;
   drop start end;
   run;

data want0 / view=want0;
   merge masterV(in=in1) excludeV(in=in2);
   by id date;
   if in1 and not in2;
   d = dif(date);
   if first.id then do;
      d=1;
      r=1;
      end;
   if d ne 1 then r+1;
   drop d;
   run;
proc summary data=want0 nway;
   by id r;
   var date;
   output out=want1(drop=_type_) min=start max=end;
   run;
proc print;
   run;


data want;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;
   run;
proc print;
   run;

2017-11-29_14-31-31.png

View solution in original post


All Replies
Super User
Posts: 2,048

Re: Fast way to remove overlapping date spans

[ Edited ]

Hi Brian,

 

data master;

informat id $1. start end mmddyy10.;

format start end mmddyy10.;

input id $ start end;

datalines;

A 01-01-2016 11-30-2016

B 01-01-2016 12-31-2016

;

run;

 

data exclude;

informat id $1. start end mmddyy10.;

format start end mmddyy10.;

input id $ start end;

datalines;

A 07-01-2016 12-31-2016

B 09-01-2016 09-30-2016

;run;

 

 

data want;

set master;

set exclude(rename=(start=_start end=_end));

if _end>end then do;

__start=start;

__end=_start-1;

output;

end;

else if _end<end then do;

__start=start;

__end=_start-1;

output;

__start=_end+1;

__end=end;

output;

end  ;

format __start __end mmddyy10.;

keep id __:;

run;

 

My assumptions:

1. Your sample is representative of your data i.e with unique id's for each row

2. Master and exclude contains equal number of records

3. Also I used extra variables with prefix __ to help us understand what happens in PDV when we do one to one merge. Obviously, you could concise that if you your understanding of PDV is sound by just resetting the original variable values of start and end.

 

Should it not work, let me know where i am missing the logic

Contributor
Posts: 30

Re: Fast way to remove overlapping date spans

Posted in reply to novinosrin

Hi,

 

Thank you for your reply. Unfortunately I have multiple rows per ID, which can vary across source datasets; and the patterns of overlap could be anything. A better sample that's closer to my production data might be more along the lines of the following:

 

 

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
run;

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.;
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 10-31-2016
B 12-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;


 

 

Solution
‎12-05-2017 09:00 AM
Respected Advisor
Posts: 3,866

Re: Fast way to remove overlapping date spans

This is similar to what you describe you are doing now.  I use 3 views and a PROC SUMMARY to output the ranges.  I tested with multiple exclude ranges for an ID but not when the exclude ranges overlap.  Requires that MASTER and EXCLUDE are ordered BY ID.  

 

data master;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
   run;
proc print;
   run;
data masterv / view=masterv;
   set master;
   do date = start to end;
      output;
      end;
   drop start end;
   format date mmddyy10.;
   run;

data exclude;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;
   run;
proc print;
   run;
data excludeV / view=excludeV;
   set exclude;
   do date = start to end;
      output;
      end;
   drop start end;
   run;

data want0 / view=want0;
   merge masterV(in=in1) excludeV(in=in2);
   by id date;
   if in1 and not in2;
   d = dif(date);
   if first.id then do;
      d=1;
      r=1;
      end;
   if d ne 1 then r+1;
   drop d;
   run;
proc summary data=want0 nway;
   by id r;
   var date;
   output out=want1(drop=_type_) min=start max=end;
   run;
proc print;
   run;


data want;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;
   run;
proc print;
   run;

2017-11-29_14-31-31.png

Contributor
Posts: 30

Re: Fast way to remove overlapping date spans

Posted in reply to data_null__

Thanks, this is great. Although it's similar in nature to what I'm currently doing, it is much more concise and should save a couple steps.

Respected Advisor
Posts: 3,866

Re: Fast way to remove overlapping date spans


bstarr wrote:

Thanks, this is great. Although it's similar in nature to what I'm currently doing, it is much more concise and should save a couple steps.


I tested with your second post of data but it does not match your WANT.  I don't think WANT in is correct in the second post.

Contributor
Posts: 30

Re: Fast way to remove overlapping date spans

Posted in reply to data_null__

Ah yes. I think I meant to exclude November 2016 for B instead of December, but working with those inputs, WANT should be:

 

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.;
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 11-30-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;run;

And this is why we have SAS to check human error Smiley Happy Thank you.

PROC Star
Posts: 622

Re: Fast way to remove overlapping date spans

Hi,

 

Take a look at this code which is something close.

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. ex_start ex_end mmddyy10.;
format ex_start ex_end mmddyy10.; 
input id $ ex_start ex_end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
run;

proc sort data=master;
by id;
proc sort data=exclude;
by id;
data master_exclude(keep=ID start1 end1 start2 end2);
format start end ex_start ex_end start1 end1 start2 end2 mmddyy10.;
merge master exclude;
by id;
if start<=ex_start<=end and start<=ex_end<=end then do;
					start1=start;
					end1=ex_start-1;
					start2=ex_end+1;
					end2=end;
					end;
else if ex_start<start and start<=ex_end<=end then do;
						start1=ex_end+1;
					end1=end;
					end;
else if start<=ex_start<=end and ex_end>end then do;
								start1=start;
								end1=ex_start-1;
								end;
if missing(start1) and missing(end1) then do;
				start1=start;
				end1=end;
			end; 
run;

proc sql;
create table want as
select id,start1 format=mmddyy10. as start, end1 format=mmddyy10. as end
	from master_exclude
	where start1 is not missing and end1 is not missing
union
select id,start2 format=mmddyy10. as start, end2 format=mmddyy10. as end
	from master_exclude
where start2 is not missing and end2 is not missing;
quit;
Thanks,
Suryakiran
Trusted Advisor
Posts: 1,387

Re: Fast way to remove overlapping date spans

This solution can also be solved via a date indexed array of one-byte characters, in which all dates marked with an 'X' are to be excluded.  After first building up all the exclusions array for an ID, just run through the master records for the same ID, and look for X's between start and  end:

 

data master;
format start end mmddyy10.; 
input id $1. start :mmddyy10. end :mmddyy10.;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
run;

data exclude;
format start end mmddyy10.; 
input id $1. start :mmddyy10. end :mmddyy10.;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
run;

%let datefirst=01jan2016;
%let datelast=31dec2016;

data want;
  array dates{%sysfunc(inputn(&datefirst,date9.)):%sysfunc(inputn(&datelast,date9.))} $1 _temporary_;

  set exclude (in=inex)  master (in=inmast rename=(start=_s end=_e));
  by id;
  if first.id then call missing(of dates{*});

  if inex then do d=start to end; dates{d}='X'; end;

  if inmast;

  do start=_s to _e;
    if dates{start}=' ' then do end=start to _e until (dates{end}='X');end;
    else continue;
    end=end-1;
    output;
    start=end;
  end;

run;
PROC Star
Posts: 274

Re: Fast way to remove overlapping date spans

[ Edited ]

The simplest solution I can think of is to line all the dates up with a switch variable to mark what happens:

data dates;
  set master(in=M) exclude;
  format date mmddyy10.;
  if M then do;
    Switch='M_on ';
    date=start;
    output;
    Switch='M_off';
    date=end;
    output;
    end;
  else do;
    Switch='E_on ';
    date=start;
    output;
    Switch='E_off';
    date=end;
    output;
    end;
  keep id date switch;
run;

proc sort;
  by id date switch; /* We put Exclude before Master on the same date */
run;

data want;
  set dates;
  by id;
  if first.id then do;
    Master=0;
    Exclude=0;
    end;
  select(switch);
    when('M_on') do;
      Master=1;
      start=date;
      end;
    when('M_off') do;
      Master=0;
      if not Exclude then do;
        end=date;
        if end>=start then
          output;
        end;
      end;
    when('E_on') do;
      Exclude=1;
      if Master then do;
        end=date-1;
        if end>=start then
          output;
        end;
      end;
    when('E_off') do;
      Exclude=0;
      if Master then
        start=date+1;
      end;
    end;
  retain start end Master Exclude;
  format start end mmddyy10.;
  keep id start end;
run;
Super User
Posts: 10,848

Re: Fast way to remove overlapping date spans

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
data master1;
 set master;
 do day=start to end;
  output;
 end;
 format day mmddyy10.;
 drop start end;
run;
data exclude1;
 set exclude;
 do day=start to end;
  output;
 end;
 format day mmddyy10.;
 drop start end;
run;
data temp;
 merge master1 exclude1(in=inb);
 by id day;
 if inb then delete;
run;
data temp;
 set temp;
 by id;
 if first.id or dif(day) ne 1 then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=day;
 if last.group then do;end=day;output;end;
 drop day group;
 format start end mmddyy10.;
run;
Contributor
Posts: 30

Re: Fast way to remove overlapping date spans

Thanks everyone for the insightful responses. As with many things SAS, there are multiple ways to accomplish any given task and the variety of solutions offered is enlightening. 

Super User
Posts: 2,048

Re: Fast way to remove overlapping date spans

[ Edited ]

Hi OP et al, It's greatto notice so many different ways to accomplish the task. I tried the below approach and I really would like to learn how this can tweaked to fit to your requirement in the objective of learning only(as i'm in school). I understand you might be too busy in a production environment, however should you find time to help me learn at your own convenient time with modification of code and feedback from you and other experts, I would most appreciate it. Thank you! and good day.

 

data master;

informat id $1. start end mmddyy10.;

format start end date9.;

input id $ start end;

datalines;

A 01-01-2016 11-30-2016

B 01-01-2016 12-31-2016

C 01-01-2016 05-31-2016

C 07-01-2016 12-31-2016

;

run;

 

data exclude;

informat id $1. start end mmddyy10.;

format start end date9.;

input id $ start end;

datalines;

A 07-01-2016 12-31-2016

B 09-01-2016 09-30-2016

B 12-01-2016 12-31-2016

D 01-01-2016 12-31-2016

;

run;

 

 

proc sort data=master;

by id start;

run;

 

proc sort data=exclude;

by id start;

run;

 

data want;

 if _N_ = 1 then do;

 if 0 then set exclude(rename=(start=ex_start end=ex_end));

     dcl hash h(dataset: 'exclude(rename=(start=ex_start end=ex_end))', ordered: 'a', multidata:'y');

     h.definekey('id');

     h.definedata('ex_start','ex_end');

     h.definedone();

  end;

set master(rename=(start=m_start end=m_end));

rc=h.find();

if rc=0 then

     do;

           start=m_start;

           do while(rc=0);

                if m_start<=ex_start<=m_end and m_start<ex_end<=m_end then

                     do;

                           end=ex_start-1;

                           output;

                           start=ex_end+1;

                           rc=h.find_next();

                     end;

                else if m_start<=ex_start<=m_end and not(m_start<ex_end<=m_end) then

                     do;

                           end=ex_start-1;

                           output;

                           rc=h.find_next();

                     end;

           end;

     end;      

else

     do;

           start=m_start;

           end=m_end;

           output;

     end;

format start end mmddyy10.;

keep id start end;

run;

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 12 replies
  • 633 views
  • 4 likes
  • 7 in conversation