Solved: Merging two interval datasets

malt0668 · Posted 03-10-2018 09:40 AM

Hello

I am stuck with a problem where i have to merge together two datasets. Both datasets contains a year variable and a ID variable. Furthermre Dataset A contains a Start and a End variable and Dataset B contains also a Start and End variable.

Dataset A: Dataset B

ID Year Start End Type ID Year Start END Type

1 2000 5000 5055 A 1 2000 5025 5100 B

1 2000 5125 5365 A 1 2000 5200 5300 B

1 2001 5366 5600 A 1 2001 5500 5800 B

Okay, so B is more "powerfull" than A, and this one takes over for A if there are some overlaps, and i should therefor end out with something like:

ID Year Start End Type

1 2000 5000 5024 A

1 2000 5025 5100 B

1 2000 5125 5199 A

1 2000 5200 5300 B

1 2000 5301 5365 A

1 2001 5366 5599 A

1 2001 5600 5800 B

I was thinking about doing a PROC SQL join, and then find some patterns, but i can't really do that (or i am not smart enough)

Hope there is someone who can help.

PGStats · Posted 03-11-2018 12:26 AM

You could for example do:

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

PG

View solution in original post

TomKari · Posted 03-10-2018 12:59 PM

These tend to be really hard.

The best way I've found to approach them is to lengthen both datasets by creating one record for every unit of "Start-End". So for example, your first A record of:

ID Year Start End Type
1 2000 5000 5055 A

would become 56 records looking like this:

ID Year SE Type
1 2000 5000 A
1 2000 5001 A
1 2000 5002 A
1 2000 5053 A
1 2000 5054 A
1 2000 5055 A

Once that's done, it's very simple to merge on ID, Year, and SE, and if there's a B record Type becomes B, otherwise it's A.

Then a fairly simple data step can reconsolidate them.

The problem, of course, is that depending on your data volumes this will absolutely explode your file size. If the volumes aren't too bad, though, this is conceptually simple and works well.

Tom

malt0668 · Posted 03-10-2018 01:30 PM

Thanks for the answer, but unfortunately there is roughly 68 million observations in each dataset, so length them would be quite tough for the size of the file .

PGStats · Posted 03-10-2018 11:28 PM

Under certain conditions, you can do this operation with arrays. Are start and end always integers? What are their min and max possible values?

PG

PGStats · Posted 03-11-2018 12:26 AM

You could for example do:

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

PG

Ksharp · Posted 03-11-2018 07:00 AM

data A;
input ID       Year        Start        End            Type $;
cards; 
1         2000       5000        5055                A         
1         2000       5125        5365                A         
1         2001       5366        5600                A     
;
run;

 

data B;
input ID       Year        Start        End       Type $;
cards;
 1            2000        5025        5100           B
 1          2000         5200        5300             B
 1           2001        5500        5800              B
;
run;
data tempA;
 set a;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data tempB;
 set b;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data temp;
 merge tempA tempB(rename=(type=_type));
 by id year date;
  new_type=coalescec(_type,type);
run;
data temp;
 set temp;
 by id year new_type notsorted;
 if first.new_type then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=date;
 if last.group then do;end=date;output;end;
 drop date group type _type;
run;

Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Re: Merging two interval datasets

Ready to join fellow brilliant minds for the SAS Hackathon?