Re: Merge two datasets by ID and date range

ari · Posted 02-02-2017 06:26 PM

dataset 1:

ID	DATE
1	10/13/2010
1	10/20/2010
1	10/27/2010
1	11/3/2010
2	1/31/2011
2	1/31/2011
2	2/2/2011
2	2/7/2011
2	2/7/2011
3	7/26/2011
3	7/26/2011
3	7/28/2011
3	8/1/2011
3	8/2/2011
3	8/2/2011
3	8/4/2011
3	8/8/2011
3	8/9/2011

dataset 2:

ID	Count	start	end
1	1	10/18/2010	10/18/2010
2	1	2/1/2011	2/6/2011
3	1	7/27/2011	7/27/2011
3	3	8/2/2011

want:

ID	DATE	Count
1	10/13/2010
1	10/20/2010
1	10/27/2010
1	11/3/2010
2	1/31/2011
2	1/31/2011
2	2/2/2011	1
2	2/7/2011
2	2/7/2011
3	7/26/2011
3	7/26/2011
3	7/28/2011
3	8/1/2011
3	8/2/2011	3
3	8/2/2011	3
3	8/4/2011	3
3	8/8/2011	3
3	8/9/2011	3

I want to merge the dataset1 and dataset2 by id and date, if the date in dataset1 falls in between start and end date of dataset2. in case if the end date is missing in the dataset2 then the count value should be retained across the rest of the observations in dataset1.

Any help to achieve this in sas?

Thanks

anoopmohandas7 · Posted 02-02-2017 11:31 PM

Do you have a field that's unique for each date range (in dataset2). I mean in the data below you gave has ID that looks like the key but is having same value (3) for two different date range ?

PGStats · Posted 02-03-2017 12:05 AM

If I understand the question :

data d1;
input ID	DATE :mmddyy.;
format date yymmdd10.;
datalines;
1	10/13/2010
1	10/20/2010
1	10/27/2010
1	11/3/2010
2	1/31/2011
2	1/31/2011
2	2/2/2011
2	2/7/2011
2	2/7/2011
3	7/26/2011
3	7/26/2011
3	7/28/2011
3	8/1/2011
3	8/2/2011
3	8/2/2011
3	8/4/2011
3	8/8/2011
3	8/9/2011
;

data d2;
input ID	Count	(start 	end) (:mmddyy10.);
format start end yymmdd10.;
datalines;
1	1	10/18/2010	10/18/2010
2	1	2/1/2011	2/6/2011
3	1	7/27/2011	7/27/2011
3	3	8/2/2011	 .
;

data d3;
set d1;
n = _n_;
run;

proc sql;
create table want as
select d3.id, d3.date, 
    max(case when date >= start and date <= coalesce(end, '01jan2100'd) 
        then count
        else . end) as count
from d3 left join d2 on d3.id=d2.id
group by n, d3.id, date;
select * from want;
quit;

You could use SUM instead of MAX if you want the sum of counts when multiple ranges match for a given id.

PG

ari · Posted 02-03-2017 02:35 AM

@PGStats: thanks. when i run it on the complete dataset the following error occurs.

WARNING: A GROUP BY clause has been transformed into an ORDER BY clause because neither the
SELECT clause nor the optional HAVING clause of the associated table-expression
referenced a summary function

I have checked the date format of all variables in both input datasets but not sure why this error is occuring. Any ideas?

PGStats · Posted 02-03-2017 11:00 AM

This warning indicates that your query didn't include a max or sum summary function. Please post the code that you ran.

PG

Merge two datasets by ID and date range