BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bstarr
Quartz | Level 8

Hello,

 

I have several datasets containing various start and end date information and need to remove overlapping dates (see sample code below). For instance, in the below sample, person A had a span of Jan 1 - Nov 30, 2016, and exclude from July 1 - Dec 31, 2016. The resulting start should be Jan 1, 2016 and the end should be June 30, 2016, a day before the exclude start date. Person B had a span of Jan 1 - Dec 2016 and an exclude of Sept 1 - Sept 30, 2016. So they would have two lines for two new start and end dates, the first from Jan 1 - Aug 31, 2016, and another from Oct 1 - Dec 31 2016.

 

Currently, I am doing this by outputting one row per day and using PROC SQL to exclude where ID and date overlap, then I combine consecutive days back into one line. This works, but the problem is that my datasets are much larger--the largest data views I create have 4-6 billion observations and I need to repeat this procedure several times for different datasets.

 

My question is whether there is a simple(ish) way that doesn't involve creating a row for each day of enrollment?

 

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;run;

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;run;

Thank you,

 

-Brian

1 ACCEPTED SOLUTION

Accepted Solutions
data_null__
Jade | Level 19

This is similar to what you describe you are doing now.  I use 3 views and a PROC SUMMARY to output the ranges.  I tested with multiple exclude ranges for an ID but not when the exclude ranges overlap.  Requires that MASTER and EXCLUDE are ordered BY ID.  

 

data master;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
   run;
proc print;
   run;
data masterv / view=masterv;
   set master;
   do date = start to end;
      output;
      end;
   drop start end;
   format date mmddyy10.;
   run;

data exclude;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;
   run;
proc print;
   run;
data excludeV / view=excludeV;
   set exclude;
   do date = start to end;
      output;
      end;
   drop start end;
   run;

data want0 / view=want0;
   merge masterV(in=in1) excludeV(in=in2);
   by id date;
   if in1 and not in2;
   d = dif(date);
   if first.id then do;
      d=1;
      r=1;
      end;
   if d ne 1 then r+1;
   drop d;
   run;
proc summary data=want0 nway;
   by id r;
   var date;
   output out=want1(drop=_type_) min=start max=end;
   run;
proc print;
   run;


data want;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;
   run;
proc print;
   run;

2017-11-29_14-31-31.png

View solution in original post

12 REPLIES 12
novinosrin
Tourmaline | Level 20

Hi Brian,

 

data master;

informat id $1. start end mmddyy10.;

format start end mmddyy10.;

input id $ start end;

datalines;

A 01-01-2016 11-30-2016

B 01-01-2016 12-31-2016

;

run;

 

data exclude;

informat id $1. start end mmddyy10.;

format start end mmddyy10.;

input id $ start end;

datalines;

A 07-01-2016 12-31-2016

B 09-01-2016 09-30-2016

;run;

 

 

data want;

set master;

set exclude(rename=(start=_start end=_end));

if _end>end then do;

__start=start;

__end=_start-1;

output;

end;

else if _end<end then do;

__start=start;

__end=_start-1;

output;

__start=_end+1;

__end=end;

output;

end  ;

format __start __end mmddyy10.;

keep id __:;

run;

 

My assumptions:

1. Your sample is representative of your data i.e with unique id's for each row

2. Master and exclude contains equal number of records

3. Also I used extra variables with prefix __ to help us understand what happens in PDV when we do one to one merge. Obviously, you could concise that if you your understanding of PDV is sound by just resetting the original variable values of start and end.

 

Should it not work, let me know where i am missing the logic

bstarr
Quartz | Level 8

Hi,

 

Thank you for your reply. Unfortunately I have multiple rows per ID, which can vary across source datasets; and the patterns of overlap could be anything. A better sample that's closer to my production data might be more along the lines of the following:

 

 

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
run;

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.;
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 10-31-2016
B 12-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;


 

 

data_null__
Jade | Level 19

This is similar to what you describe you are doing now.  I use 3 views and a PROC SUMMARY to output the ranges.  I tested with multiple exclude ranges for an ID but not when the exclude ranges overlap.  Requires that MASTER and EXCLUDE are ordered BY ID.  

 

data master;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
;
   run;
proc print;
   run;
data masterv / view=masterv;
   set master;
   do date = start to end;
      output;
      end;
   drop start end;
   format date mmddyy10.;
   run;

data exclude;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
;
   run;
proc print;
   run;
data excludeV / view=excludeV;
   set exclude;
   do date = start to end;
      output;
      end;
   drop start end;
   run;

data want0 / view=want0;
   merge masterV(in=in1) excludeV(in=in2);
   by id date;
   if in1 and not in2;
   d = dif(date);
   if first.id then do;
      d=1;
      r=1;
      end;
   if d ne 1 then r+1;
   drop d;
   run;
proc summary data=want0 nway;
   by id r;
   var date;
   output out=want1(drop=_type_) min=start max=end;
   run;
proc print;
   run;


data want;
   informat id $1. start end mmddyy10.;
   format start end mmddyy10.; 
   input id $ start end;
   datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 12-31-2016
;
   run;
proc print;
   run;

2017-11-29_14-31-31.png

bstarr
Quartz | Level 8

Thanks, this is great. Although it's similar in nature to what I'm currently doing, it is much more concise and should save a couple steps.

data_null__
Jade | Level 19

@bstarr wrote:

Thanks, this is great. Although it's similar in nature to what I'm currently doing, it is much more concise and should save a couple steps.


I tested with your second post of data but it does not match your WANT.  I don't think WANT in is correct in the second post.

bstarr
Quartz | Level 8

Ah yes. I think I meant to exclude November 2016 for B instead of December, but working with those inputs, WANT should be:

 

data want;
informat id $1. start end mmddyy10.;
format start end mmddyy10.;
input id $ start end;
datalines;
A 01-01-2016 06-30-2016
B 01-01-2016 08-31-2016
B 10-01-2016 11-30-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;run;

And this is why we have SAS to check human error 🙂 Thank you.

SuryaKiran
Meteorite | Level 14

Hi,

 

Take a look at this code which is something close.

data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. ex_start ex_end mmddyy10.;
format ex_start ex_end mmddyy10.; 
input id $ ex_start ex_end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
run;

proc sort data=master;
by id;
proc sort data=exclude;
by id;
data master_exclude(keep=ID start1 end1 start2 end2);
format start end ex_start ex_end start1 end1 start2 end2 mmddyy10.;
merge master exclude;
by id;
if start<=ex_start<=end and start<=ex_end<=end then do;
					start1=start;
					end1=ex_start-1;
					start2=ex_end+1;
					end2=end;
					end;
else if ex_start<start and start<=ex_end<=end then do;
						start1=ex_end+1;
					end1=end;
					end;
else if start<=ex_start<=end and ex_end>end then do;
								start1=start;
								end1=ex_start-1;
								end;
if missing(start1) and missing(end1) then do;
				start1=start;
				end1=end;
			end; 
run;

proc sql;
create table want as
select id,start1 format=mmddyy10. as start, end1 format=mmddyy10. as end
	from master_exclude
	where start1 is not missing and end1 is not missing
union
select id,start2 format=mmddyy10. as start, end2 format=mmddyy10. as end
	from master_exclude
where start2 is not missing and end2 is not missing;
quit;
Thanks,
Suryakiran
mkeintz
PROC Star

This solution can also be solved via a date indexed array of one-byte characters, in which all dates marked with an 'X' are to be excluded.  After first building up all the exclusions array for an ID, just run through the master records for the same ID, and look for X's between start and  end:

 

data master;
format start end mmddyy10.; 
input id $1. start :mmddyy10. end :mmddyy10.;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
run;

data exclude;
format start end mmddyy10.; 
input id $1. start :mmddyy10. end :mmddyy10.;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
run;

%let datefirst=01jan2016;
%let datelast=31dec2016;

data want;
  array dates{%sysfunc(inputn(&datefirst,date9.)):%sysfunc(inputn(&datelast,date9.))} $1 _temporary_;

  set exclude (in=inex)  master (in=inmast rename=(start=_s end=_e));
  by id;
  if first.id then call missing(of dates{*});

  if inex then do d=start to end; dates{d}='X'; end;

  if inmast;

  do start=_s to _e;
    if dates{start}=' ' then do end=start to _e until (dates{end}='X');end;
    else continue;
    end=end-1;
    output;
    start=end;
  end;

run;
--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
s_lassen
Meteorite | Level 14

The simplest solution I can think of is to line all the dates up with a switch variable to mark what happens:

data dates;
  set master(in=M) exclude;
  format date mmddyy10.;
  if M then do;
    Switch='M_on ';
    date=start;
    output;
    Switch='M_off';
    date=end;
    output;
    end;
  else do;
    Switch='E_on ';
    date=start;
    output;
    Switch='E_off';
    date=end;
    output;
    end;
  keep id date switch;
run;

proc sort;
  by id date switch; /* We put Exclude before Master on the same date */
run;

data want;
  set dates;
  by id;
  if first.id then do;
    Master=0;
    Exclude=0;
    end;
  select(switch);
    when('M_on') do;
      Master=1;
      start=date;
      end;
    when('M_off') do;
      Master=0;
      if not Exclude then do;
        end=date;
        if end>=start then
          output;
        end;
      end;
    when('E_on') do;
      Exclude=1;
      if Master then do;
        end=date-1;
        if end>=start then
          output;
        end;
      end;
    when('E_off') do;
      Exclude=0;
      if Master then
        start=date+1;
      end;
    end;
  retain start end Master Exclude;
  format start end mmddyy10.;
  keep id start end;
run;
Ksharp
Super User
data master;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 01-01-2016 11-30-2016
B 01-01-2016 12-31-2016
C 01-01-2016 05-31-2016
C 07-01-2016 12-31-2016
;
run;

data exclude;
informat id $1. start end mmddyy10.;
format start end mmddyy10.; 
input id $ start end;
datalines;
A 07-01-2016 12-31-2016
B 09-01-2016 09-30-2016
B 12-01-2016 12-31-2016
D 01-01-2016 12-31-2016
;
data master1;
 set master;
 do day=start to end;
  output;
 end;
 format day mmddyy10.;
 drop start end;
run;
data exclude1;
 set exclude;
 do day=start to end;
  output;
 end;
 format day mmddyy10.;
 drop start end;
run;
data temp;
 merge master1 exclude1(in=inb);
 by id day;
 if inb then delete;
run;
data temp;
 set temp;
 by id;
 if first.id or dif(day) ne 1 then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=day;
 if last.group then do;end=day;output;end;
 drop day group;
 format start end mmddyy10.;
run;
bstarr
Quartz | Level 8

Thanks everyone for the insightful responses. As with many things SAS, there are multiple ways to accomplish any given task and the variety of solutions offered is enlightening. 

novinosrin
Tourmaline | Level 20

Hi OP et al, It's greatto notice so many different ways to accomplish the task. I tried the below approach and I really would like to learn how this can tweaked to fit to your requirement in the objective of learning only(as i'm in school). I understand you might be too busy in a production environment, however should you find time to help me learn at your own convenient time with modification of code and feedback from you and other experts, I would most appreciate it. Thank you! and good day.

 

data master;

informat id $1. start end mmddyy10.;

format start end date9.;

input id $ start end;

datalines;

A 01-01-2016 11-30-2016

B 01-01-2016 12-31-2016

C 01-01-2016 05-31-2016

C 07-01-2016 12-31-2016

;

run;

 

data exclude;

informat id $1. start end mmddyy10.;

format start end date9.;

input id $ start end;

datalines;

A 07-01-2016 12-31-2016

B 09-01-2016 09-30-2016

B 12-01-2016 12-31-2016

D 01-01-2016 12-31-2016

;

run;

 

 

proc sort data=master;

by id start;

run;

 

proc sort data=exclude;

by id start;

run;

 

data want;

 if _N_ = 1 then do;

 if 0 then set exclude(rename=(start=ex_start end=ex_end));

     dcl hash h(dataset: 'exclude(rename=(start=ex_start end=ex_end))', ordered: 'a', multidata:'y');

     h.definekey('id');

     h.definedata('ex_start','ex_end');

     h.definedone();

  end;

set master(rename=(start=m_start end=m_end));

rc=h.find();

if rc=0 then

     do;

           start=m_start;

           do while(rc=0);

                if m_start<=ex_start<=m_end and m_start<ex_end<=m_end then

                     do;

                           end=ex_start-1;

                           output;

                           start=ex_end+1;

                           rc=h.find_next();

                     end;

                else if m_start<=ex_start<=m_end and not(m_start<ex_end<=m_end) then

                     do;

                           end=ex_start-1;

                           output;

                           rc=h.find_next();

                     end;

           end;

     end;      

else

     do;

           start=m_start;

           end=m_end;

           output;

     end;

format start end mmddyy10.;

keep id start end;

run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 12 replies
  • 4139 views
  • 4 likes
  • 7 in conversation