I have the table like:
source | target | start | end |
A | X | 20050106 | 20070506 |
B | Y | 20030601 | 20060704 |
B | Y | 20060704 | 20080905 |
B | Y | 20080905 | 20091211 |
B | P | 20040704 | 20080905 |
B | P | 20061001 | 20100806 |
C | L | 20070911 | 20081120 |
I hope to merge the time range of the target corresponding to each source, and get this table:
Can you help me? Thank you!
source | target | start | end |
A | X | 20050106 | 20070506 |
B | Y | 20030601 | 20091211 |
B | P | 20040704 | 20100806 |
C | L | 20070911 | 20081120 |
data have;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
data want;
set have end=done;
set
have (
firstobs=2
keep=source target start
rename=(source=_source target=_target start=_start)
)
have (
obs=1
keep=source target start
rename=(source=_source target=_target start=_start)
)
;
*by source target;
retain __start;
if _n_ = 1 then __start = start;
*if first.source then __start = start;
if
source ne _source or target ne _target
or _start > end + 1
or done
then do;
start = __start;
output;
__start = _start;
end;
drop _:;
run;
proc print data=want noobs;
run;
The output:
source target start end A X 20050106 20070506 B Y 20030601 20091211 B P 20040704 20100806 C L 20070911 20081120 D M 20030507 20040506 D M 20050809 20071001
matches your "want" example.
proc sql;
create table want as
select
source,
target,
min(start) as start,
max(end) as end
from have
group by source, target
;
quit;
Thanks for your answer. But I forgot one situation. I don't want to merge non-overlapping times. For example, what should I do in the following situation?
source | target | start | end |
D | M | 20030507 | 20040506 |
D | M | 20050809 | 20071001 |
In this situation, I don't want to do any merge, what should I do? Thank you!
Could be solved by a data step, too, but i don't think, that it is better.
Just to clarify: you don want one obs for each source/target combination with the earliest start date and the latest end date?
For time ranges that are connected or overlap, I want to save the start and end times. For non-overlapping times, I don't want to merge them.
Yes, that situation exists. If the two separate date ranges don't overlap, I prefer not to merge them and leave them as they are.
@Alice_SAS wrote:
Yes, that situation exists. If the two separate date ranges don't overlap, I prefer not to merge them and leave them as they are.
Please post data containing this case and post the data in usable form.
I have the data:
data have;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;
I want to obtain the form:
data want;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20091211
B P 20040704 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;
Thank you!
For the sample data you shared something like below should work.
data have;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;
data want;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20091211
B P 20040704 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
run;
proc sort data=have out=have_sorted;
by source target start end;
run;
data want2;
set have_sorted;
by source target start end;
_next_obs=min(_n_+1,_nobs);
set have_sorted(keep=start rename=(start=_next_start)) point=_next_obs nobs=_nobs;
if last.target or end<_next_start then output;
drop _next_start;
run;
proc print data=want2;
run;
Thank you for your help. However, I found an error in the output of your code regarding the results of B Y 20030601 20091211 and B P 20040704 20100806 in the Want table I provided. How can I modify the code?
data want (drop=_:);
set have (keep=source target);
by source target notsorted;
merge have (rename=(start=_current_start))
have (firstobs=2 keep=start rename=(start=_nxt_start));
retain start ;
format start yymmddn8. ;
if first.target=1 or _current_start-1>lag(end) then start=_current_start;
if last.target=1 or end+1<_nxt_start;
run;
The SET and MERGE statement set up two streams of data, read in parallel.
The first SET is accompanied by the BY statement, allowing use of first.target and last.target dummies. The data are assumed to be grouped by source*target, and within each source*target group, to be sorted by start. (and also sorted by END within each group).
The MERGE statement has two "substreams" of data, also read in parallel, but because of the FIRSTOBS=2 option provides a way to look ahead at the upcoming START value (in variable _nxt_start), so upcoming date gaps can easily be found.
The first IF identifies the beginning of a continuous sequence of observations.
The second if identified the end of same.
data have;
input source:$1. target:$1. start: YYMMDD8. end: YYMMDD8.;
format start YYMMDDn8. end YYMMDDn8.;
datalines;
A X 20050106 20070506
B Y 20030601 20060704
B Y 20060704 20080905
B Y 20080905 20091211
B P 20040704 20080905
B P 20061001 20100806
C L 20070911 20081120
D M 20030507 20040506
D M 20050809 20071001
;
data want;
set have end=done;
set
have (
firstobs=2
keep=source target start
rename=(source=_source target=_target start=_start)
)
have (
obs=1
keep=source target start
rename=(source=_source target=_target start=_start)
)
;
*by source target;
retain __start;
if _n_ = 1 then __start = start;
*if first.source then __start = start;
if
source ne _source or target ne _target
or _start > end + 1
or done
then do;
start = __start;
output;
__start = _start;
end;
drop _:;
run;
proc print data=want noobs;
run;
The output:
source target start end A X 20050106 20070506 B Y 20030601 20091211 B P 20040704 20100806 C L 20070911 20081120 D M 20030507 20040506 D M 20050809 20071001
matches your "want" example.
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.