BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
malt0668
Calcite | Level 5

Hello 

I am stuck with a problem where i have to merge together two datasets. Both datasets contains a year variable and a ID variable. Furthermre Dataset A contains a Start and a End variable and Dataset B contains also a Start and End variable.

Dataset A:                                                            Dataset B

 

ID       Year        Start        End            Type             ID          Year        Start         END          Type

1         2000       5000        5055                A             1            2000        5025        5100           B

1         2000       5125        5365                A             1          2000         5200        5300             B

1         2001       5366        5600                A             1           2001        5500        5800              B

 

Okay, so B is more "powerfull" than A, and this one takes over for A if there are some overlaps, and i should therefor end out with something like:

 

ID       Year        Start        End           Type

1         2000       5000        5024              A      

1         2000       5025       5100               B             

1         2000       5125        5199               A

1         2000       5200       5300               B

1         2000       5301       5365               A

1         2001       5366       5599               A

1         2001       5600       5800              B

 

I was thinking about doing a PROC SQL join, and then find some patterns, but i can't really do that (or i am not smart enough)

 

Hope there is someone who can help.

  

 

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

You could for example do:

 

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

 

PG

View solution in original post

5 REPLIES 5
TomKari
Onyx | Level 15

These tend to be really hard.

 

The best way I've found to approach them is to lengthen both datasets by creating one record for every unit of "Start-End". So for example, your first A record of:

 

ID Year Start End Type
1 2000 5000 5055 A

 

would become 56 records looking like this:

 

ID Year SE Type
1 2000 5000 A
1 2000 5001 A
1 2000 5002 A
1 2000 5053 A
1 2000 5054 A
1 2000 5055 A

 

Once that's done, it's very simple to merge on ID, Year, and SE, and if there's a B record Type becomes B, otherwise it's A.

Then a fairly simple data step can reconsolidate them.

 

The problem, of course, is that depending on your data volumes this will absolutely explode your file size. If the volumes aren't too bad, though, this is conceptually simple and works well.

 

Tom

malt0668
Calcite | Level 5

Thanks for the answer, but unfortunately there is roughly 68 million observations in each dataset, so length them would be quite tough for the size of the file .

PGStats
Opal | Level 21

Under certain conditions, you can do this operation with arrays. Are start and end always integers? What are their min and max possible values?

PG
PGStats
Opal | Level 21

You could for example do:

 

data C;
array x(10000) $1;
do until (last.year);
    set A B;
    by id year;
    do i = start to end;
        x{i} = type;
        end;
    end;
start = 1;
do i = 2 to dim(x);
    if x{i} ne x{i-1} then do;
        type = x{i-1};
        end = i-1;
        if type in ("A", "B") then output;
        start = i;
        end;
    end;
keep id year start end type;
run;

if start and end are ointegers always within the range 2-9999.

 

PG
Ksharp
Super User
data A;
input ID       Year        Start        End            Type $;
cards; 
1         2000       5000        5055                A         
1         2000       5125        5365                A         
1         2001       5366        5600                A     
;
run;

 

data B;
input ID       Year        Start        End       Type $;
cards;
 1            2000        5025        5100           B
 1          2000         5200        5300             B
 1           2001        5500        5800              B
;
run;
data tempA;
 set a;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data tempB;
 set b;
 do date=start to end;
  output;
 end;
 drop start end;
run;
data temp;
 merge tempA tempB(rename=(type=_type));
 by id year date;
  new_type=coalescec(_type,type);
run;
data temp;
 set temp;
 by id year new_type notsorted;
 if first.new_type then group+1;
run;
data want;
 set temp;
 by group;
 retain start;
 if first.group then start=date;
 if last.group then do;end=date;output;end;
 drop date group type _type;
run;

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to connect to databases in SAS Viya

Need to connect to databases in SAS Viya? SAS’ David Ghan shows you two methods – via SAS/ACCESS LIBNAME and SAS Data Connector SASLIBS – in this video.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 5 replies
  • 964 views
  • 0 likes
  • 4 in conversation