09-16-2015 08:58 AM
I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.
How can I achieve this using proc SQL only?
id startdate enddate-------------------------- 101 12/28/2013 12/31/2013
id startdate enddate-------------------------- 101 12/15/2013 12/15/2013 101 12/16/2013 12/16/2013 101 12/28/2013 12/28/2013 101 12/29/2013 12/31/2013
id startdate enddate------------------------- 101 12/15/2013 12/16/2013 101 12/28/2013 12/31/2013
09-16-2015 11:25 AM
It appears that you are collapsing intervals from data set B. Are you doing the same from dataset A? You should include an example for the data set A that clearly isn't in B so we have some more idea what is going on as I can see the final result coming only from B.
And why a restriction for proc sql only? Since order is important that restriction may add complexity to code that could be very straight forward in data step code.
09-16-2015 01:13 PM - edited 09-16-2015 01:16 PM
Sometimes solving problems are all about choosing the right tools. Like @ballardw mentioned, data step is the right tool for this kind of task. That said, it still can be done using Proc SQL, but you may not like the complexity of the code:
proc sql; create table have as select * from table_a union corr select * from table_b ; quit; proc sql; create table want_sql as select distinct id, min(startdate, (select min(startdate) from have where id=a.id and (startdate between a.startdate-1 and a.enddate+1 or enddate between a.startdate and a.enddate or a.startdate between startdate-1 and enddate+1 or a.enddate between startdate and enddate))) as startdate format=date9., max(enddate, (select max(enddate) from have where id=a.id and (startdate between a.startdate-1 and a.enddate+1 or enddate between a.startdate and a.enddate or a.startdate between startdate-1 and enddate+1 or a.enddate between startdate and enddate))) as enddate format=date9. from have a ; quit;
The idea is simple: try to group the interval; while the implementaion is strenous. Good Luck!
09-16-2015 05:29 PM
Below is another pure PROC SQL approach. It uses a self-join twice to (detect earlier and later rows) and enables two rows being collapsed into one. However, it does not deal with collapsing three or more rows into one; not sure if that kind of scenario would happen in your data, based on the example you provided.
PROC SQL; CREATE TABLE have ( id num, startdate num format=date11., enddate num format=date11.); QUIT; PROC SQL; INSERT INTO have values (101, '15-Dec-2013'd, '15-Dec-2013'd) values (101, '16-Dec-2013'd, '16-Dec-2013'd) values (101, '28-Dec-2013'd, '28-Dec-2013'd) values (101, '29-Dec-2013'd, '31-Dec-2013'd) /* add some additional data for good measure ... */ /* note that there are 3 consecutive rows (1/15 to 1/28/2014) */ /* values (201, '15-Jan-2014'd, '17-Jan-2014'd) */ /* values (201, '18-Jan-2014'd, '26-Jan-2014'd) */ /* values (201, '27-Jan-2014'd, '28-Jan-2014'd) */ /* values (201, '01-Feb-2014'd, '01-Feb-2014'd) */ ; QUIT; PROC SQL; CREATE TABLE want AS SELECT id , startdate , coalesce(nextRow_enddate, enddate) format=date11. AS wanted_enddate /* subquery joins current row to earlier row and future row (if they exist) */ FROM (SELECT aa.id , aa.startdate , aa.enddate , monotonic() as seq_num , next.enddate as nextRow_enddate , prior.enddate as priorRow_enddate FROM have as aa LEFT JOIN have as next ON next.id = aa.id AND next.startdate = aa.enddate+1 LEFT JOIN have as prior ON prior.id = aa.id AND prior.enddate = aa.startdate-1 WHERE priorRow_enddate IS NULL) as qry; QUIT;