Hello
I am stuck with a problem where i have to merge together two datasets. Both datasets contains a year variable and a ID variable. Furthermre Dataset A contains a Start and a End variable and Dataset B contains also a Start and End variable.
Dataset A: Dataset B
ID Year Start End Type ID Year Start END Type
1 2000 5000 5055 A 1 2000 5025 5100 B
1 2000 5125 5365 A 1 2000 5200 5300 B
1 2001 5366 5600 A 1 2001 5500 5800 B
Okay, so B is more "powerfull" than A, and this one takes over for A if there are some overlaps, and i should therefor end out with something like:
ID Year Start End Type
1 2000 5000 5024 A
1 2000 5025 5100 B
1 2000 5125 5199 A
1 2000 5200 5300 B
1 2000 5301 5365 A
1 2001 5366 5599 A
1 2001 5600 5800 B
I was thinking about doing a PROC SQL join, and then find some patterns, but i can't really do that (or i am not smart enough)
Hope there is someone who can help.
You could for example do:
data C;
array x(10000) $1;
do until (last.year);
set A B;
by id year;
do i = start to end;
x{i} = type;
end;
end;
start = 1;
do i = 2 to dim(x);
if x{i} ne x{i-1} then do;
type = x{i-1};
end = i-1;
if type in ("A", "B") then output;
start = i;
end;
end;
keep id year start end type;
run;
if start and end are ointegers always within the range 2-9999.
These tend to be really hard.
The best way I've found to approach them is to lengthen both datasets by creating one record for every unit of "Start-End". So for example, your first A record of:
ID Year Start End Type
1 2000 5000 5055 A
would become 56 records looking like this:
ID Year SE Type
1 2000 5000 A
1 2000 5001 A
1 2000 5002 A
1 2000 5053 A
1 2000 5054 A
1 2000 5055 A
Once that's done, it's very simple to merge on ID, Year, and SE, and if there's a B record Type becomes B, otherwise it's A.
Then a fairly simple data step can reconsolidate them.
The problem, of course, is that depending on your data volumes this will absolutely explode your file size. If the volumes aren't too bad, though, this is conceptually simple and works well.
Tom
Thanks for the answer, but unfortunately there is roughly 68 million observations in each dataset, so length them would be quite tough for the size of the file .
Under certain conditions, you can do this operation with arrays. Are start and end always integers? What are their min and max possible values?
You could for example do:
data C;
array x(10000) $1;
do until (last.year);
set A B;
by id year;
do i = start to end;
x{i} = type;
end;
end;
start = 1;
do i = 2 to dim(x);
if x{i} ne x{i-1} then do;
type = x{i-1};
end = i-1;
if type in ("A", "B") then output;
start = i;
end;
end;
keep id year start end type;
run;
if start and end are ointegers always within the range 2-9999.
data A;
input ID Year Start End Type $;
cards;
1 2000 5000 5055 A
1 2000 5125 5365 A
1 2001 5366 5600 A
;
run;
data B;
input ID Year Start End Type $;
cards;
1 2000 5025 5100 B
1 2000 5200 5300 B
1 2001 5500 5800 B
;
run;
data tempA;
set a;
do date=start to end;
output;
end;
drop start end;
run;
data tempB;
set b;
do date=start to end;
output;
end;
drop start end;
run;
data temp;
merge tempA tempB(rename=(type=_type));
by id year date;
new_type=coalescec(_type,type);
run;
data temp;
set temp;
by id year new_type notsorted;
if first.new_type then group+1;
run;
data want;
set temp;
by group;
retain start;
if first.group then start=date;
if last.group then do;end=date;output;end;
drop date group type _type;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.
Find more tutorials on the SAS Users YouTube channel.