SAS Data Integration Studio, DataFlux Data Management Studio, SAS/ACCESS, SAS Data Loader for Hadoop and others

Merging two interval datasets

Accepted Solution Solved
Reply
New Contributor
Posts: 2
Accepted Solution

Merging two interval datasets

Hello 

I am stuck with a problem where i have to merge together two datasets. Both datasets contains a year variable and a ID variable. Furthermre Dataset A contains a Start and a End variable and Dataset B contains also a Start and End variable.

Dataset A:                                                            Dataset B

 

ID       Year        Start        End            Type             ID          Year        Start         END          Type

1         2000       5000        5055                A             1            2000        5025        5100           B

1         2000       5125        5365                A             1          2000         5200        5300             B

1         2001       5366        5600                A             1           2001        5500        5800              B

 

Okay, so B is more "powerfull" than A, and this one takes over for A if there are some overlaps, and i should therefor end out with something like:

 

ID       Year        Start        End           Type

1         2000       5000        5024              A      

1         2000       5025       5100               B             

1         2000       5125        5199               A

1         2000       5200       5300               B

1         2000       5301       5365               A

1         2001       5366       5599               A

1         2001       5600       5800              B

 

I was thinking about doing a PROC SQL join, and then find some patterns, but i can't really do that (or i am not smart enough)

 

Hope there is someone who can help.

  

 


Accepted Solutions
Solution
‎03-11-2018 05:20 AM
Esteemed Advisor
Posts: 5,474

Re: Merging two interval datasets

You could for example do:

 

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

 

PG

View solution in original post


All Replies
PROC Star
Posts: 1,288

Re: Merging two interval datasets

These tend to be really hard.

 

The best way I've found to approach them is to lengthen both datasets by creating one record for every unit of "Start-End". So for example, your first A record of:

 

ID Year Start End Type
1 2000 5000 5055 A

 

would become 56 records looking like this:

 

ID Year SE Type
1 2000 5000 A
1 2000 5001 A
1 2000 5002 A
1 2000 5053 A
1 2000 5054 A
1 2000 5055 A

 

Once that's done, it's very simple to merge on ID, Year, and SE, and if there's a B record Type becomes B, otherwise it's A.

Then a fairly simple data step can reconsolidate them.

 

The problem, of course, is that depending on your data volumes this will absolutely explode your file size. If the volumes aren't too bad, though, this is conceptually simple and works well.

 

Tom

New Contributor
Posts: 2

Re: Merging two interval datasets

Thanks for the answer, but unfortunately there is roughly 68 million observations in each dataset, so length them would be quite tough for the size of the file .

Esteemed Advisor
Posts: 5,474

Re: Merging two interval datasets

[ Edited ]

Under certain conditions, you can do this operation with arrays. Are start and end always integers? What are their min and max possible values?

PG
Solution
‎03-11-2018 05:20 AM
Esteemed Advisor
Posts: 5,474

Re: Merging two interval datasets

You could for example do:

 

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

 

PG
Super User
Posts: 10,681

Re: Merging two interval datasets

data A;
input ID       Year        Start        End            Type $;
cards; 
1         2000       5000        5055                A         
1         2000       5125        5365                A         
1         2001       5366        5600                A     
;
run;

 

data B;
input ID       Year        Start        End       Type $;
cards;
 1            2000        5025        5100           B
 1          2000         5200        5300             B
 1           2001        5500        5800              B
;
run;
data tempA;
 set a;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data tempB;
 set b;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data temp;
 merge tempA tempB(rename=(type=_type));
 by id year date;
  new_type=coalescec(_type,type);
run;
data temp;
 set temp;
 by id year new_type notsorted;
 if first.new_type then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=date;
 if last.group then do;end=date;output;end;
 drop date group type _type;
run;
☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 5 replies
  • 168 views
  • 0 likes
  • 4 in conversation