Solved: Merge data by date range

Alice_SAS · Posted 10-01-2023 10:28 PM

I have the table like:

source	target	start	end
A	X	20050106	20070506
B	Y	20030601	20060704
B	Y	20060704	20080905
B	Y	20080905	20091211
B	P	20040704	20080905
B	P	20061001	20100806
C	L	20070911	20081120

I hope to merge the time range of the target corresponding to each source, and get this table:

Can you help me? Thank you!

source	target	start	end
A	X	20050106	20070506
B	Y	20030601	20091211
B	P	20040704	20100806
C	L	20070911	20081120

Kurt_Bremser · Posted 10-09-2023 06:37 AM

data have;
  input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
  format start YYMMDDn8. end YYMMDDn8.;
  datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;

data want;
set have end=done;
set
  have (
    firstobs=2
    keep=source target start
    rename=(source=_source target=_target start=_start)
  )
  have (
    obs=1
    keep=source target start
    rename=(source=_source target=_target start=_start)
  )
;
*by source target;
retain __start;
if _n_ = 1 then __start = start;
*if first.source then __start = start;
if
  source ne _source or target ne _target 
  or _start > end + 1
  or done
then do;
  start = __start;
  output;
  __start = _start;
end;
drop _:;
run;

proc print data=want noobs;
run;

The output:

source	target	start	end
A	X	20050106	20070506
B	Y	20030601	20091211
B	P	20040704	20100806
C	L	20070911	20081120
D	M	20030507	20040506
D	M	20050809	20071001

matches your "want" example.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

Kurt_Bremser · Posted 10-02-2023 01:43 AM

proc sql;
create table want as
  select
    source,
    target,
    min(start) as start,
    max(end) as end
  from have
  group by source, target
;
quit;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Alice_SAS · Posted 10-09-2023 02:04 AM

Thanks for your answer. But I forgot one situation. I don't want to merge non-overlapping times. For example, what should I do in the following situation?

source	target	start	end
D	M	20030507	20040506
D	M	20050809	20071001

In this situation, I don't want to do any merge, what should I do? Thank you!

andreas_lds · Posted 10-02-2023 02:27 AM

Could be solved by a data step, too, but i don't think, that it is better.

Just to clarify: you don want one obs for each source/target combination with the earliest start date and the latest end date?

Alice_SAS · Posted 10-09-2023 01:58 AM

For time ranges that are connected or overlap, I want to save the start and end times. For non-overlapping times, I don't want to merge them.

Astounding · Posted 10-02-2023 04:50 AM

Is it possible that a SOURCE/TARGET might contain two separate date ranges that do not overlap? If so, what should the result be?

Alice_SAS · Posted 10-09-2023 01:54 AM

Yes, that situation exists. If the two separate date ranges don't overlap, I prefer not to merge them and leave them as they are.

andreas_lds · Posted 10-09-2023 02:30 AM

@Alice_SAS wrote:

Yes, that situation exists. If the two separate date ranges don't overlap, I prefer not to merge them and leave them as they are.

Please post data containing this case and post the data in usable form.

Alice_SAS · Posted 10-09-2023 03:58 AM

I have the data:

data have;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;

I want to obtain the form:

data want;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20091211
B P 20040704 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;

Thank you!

Patrick · Posted 10-09-2023 04:40 AM

For the sample data you shared something like below should work.

data have;
  input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
  format start YYMMDDn8. end YYMMDDn8.;
  datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;

data want;
  input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
  format start YYMMDDn8. end YYMMDDn8.;
  datalines;
A X 20050106 20070506
B Y 20030601 20091211
B P 20040704 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;

proc sort data=have out=have_sorted;
  by source target start end;
run;

data want2;
  set have_sorted;
  by source target start end;

  _next_obs=min(_n_+1,_nobs);
  set have_sorted(keep=start rename=(start=_next_start)) point=_next_obs nobs=_nobs;
  if last.target or end<_next_start then output;
  drop _next_start;
run;

proc print data=want2;
run;

Alice_SAS · Posted 10-09-2023 05:59 AM

Thank you for your help. However, I found an error in the output of your code regarding the results of B Y 20030601 20091211 and B P 20040704 20100806 in the Want table I provided. How can I modify the code？

mkeintz · Posted 10-09-2023 02:26 PM

data want (drop=_:);
  set have (keep=source target);
  by source target  notsorted;

  merge have (rename=(start=_current_start))
        have (firstobs=2 keep=start rename=(start=_nxt_start));

  retain start ;  
  format start yymmddn8. ;

  if first.target=1 or _current_start-1>lag(end) then start=_current_start;
  if last.target=1  or end+1<_nxt_start;
run;

The SET and MERGE statement set up two streams of data, read in parallel.

The first SET is accompanied by the BY statement, allowing use of first.target and last.target dummies. The data are assumed to be grouped by source*target, and within each source*target group, to be sorted by start. (and also sorted by END within each group).

The MERGE statement has two "substreams" of data, also read in parallel, but because of the FIRSTOBS=2 option provides a way to look ahead at the upcoming START value (in variable _nxt_start), so upcoming date gaps can easily be found.

The first IF identifies the beginning of a continuous sequence of observations.

The second if identified the end of same.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Kurt_Bremser · Posted 10-09-2023 06:37 AM

data have;
  input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
  format start YYMMDDn8. end YYMMDDn8.;
  datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;

data want;
set have end=done;
set
  have (
    firstobs=2
    keep=source target start
    rename=(source=_source target=_target start=_start)
  )
  have (
    obs=1
    keep=source target start
    rename=(source=_source target=_target start=_start)
  )
;
*by source target;
retain __start;
if _n_ = 1 then __start = start;
*if first.source then __start = start;
if
  source ne _source or target ne _target 
  or _start > end + 1
  or done
then do;
  start = __start;
  output;
  __start = _start;
end;
drop _:;
run;

proc print data=want noobs;
run;

The output:

source	target	start	end
A	X	20050106	20070506
B	Y	20030601	20091211
B	P	20040704	20100806
C	L	20070911	20081120
D	M	20030507	20040506
D	M	20050809	20071001

matches your "want" example.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

Re: Merge data by date range

SAS Innovate 2025: Register Now

SAS Training: Just a Click Away