Dates and Merging.

SamT · Posted 06-09-2011 01:46 AM

Hello,

I've got two data sets.

I'm trying to get these...

id,StartDate, Value,
001,01JUL2010, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500

into...

id, StartDate, EndDate, OtherValue
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC9999, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC9999,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990

and return the other value that would have been during that time...

id, StartDate, Value, OtherValue
001,01JUL2010, 6050, 6000
002,27MAY2010, 9050, 11000
003,11DEC2010, 16500, 20000

So I have a variable to merge on but the startdate variable in the first set needs to read in between the start and end date in the second one and then output the other value variable.

I'm not very experienced with any merging outside of indexing...

data name (index=());
set sample;
set sample2 key=/unique;
if _error_ then delete (or output...);
run;

Thanks in advance.

Ksharp · Posted 06-09-2011 02:46 AM

Or If you like, Hash Table also do it with more efficient.

[pre]

data temp1;
infile datalines dlm=' ,';
input id $ StartDate : date9. Value;
format StartDate date9.;
datalines;
001,01JUL2010, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500
;
run;
data temp2;
infile datalines dlm=' ,';
input id $ StartDate : date9. EndDate : date9. OtherValue;
datalines;
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC9999, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC9999,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990
;
run;
data want(drop=_StartDate EndDate);
merge temp1 temp2(rename=(StartDate=_StartDate));
by id;
if StartDate ge _StartDate and StartDate le EndDate ;
run;
[/pre]

Ksharp

SamT · Posted 06-13-2011 11:41 PM

Hi KSharp,

Ran you practise script and it worked a treat, applied it to what I was working on and it produced some kind of monster.

Reason I suspect i'm not obtaining the desired results would be that the ID number in the first set can appear multiple times, with a unique StartDate.

What other methods are there of merging?

sbb · Posted 06-14-2011 06:42 AM

Have a look at BY GROUP PROCESSING in the SAS.COM support site.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:

data step by group processing site:sas.com

DF · Posted 06-14-2011 12:05 PM

I think KSharp's method requires that the ID in table temp1 is unique, otherwise a merge would start producing weird results.

I have recently had a similar issue in joining up large time series data. I found the notes on Fuzzy Merges in http://www2.sas.com/proceedings/sugi24/Advtutor/p46-24.pdf to be quite useful in working out a solution.

Assuming that your "lookup" table - Temp2 in KSharp's example code - has no overlapping time ranges, then this should work. I used this method over something like an SQL join because both of my datasets are huge (about 5 million in Temp1 and about 100 million in Temp2).

By using an interleave, I reckon you're only really doing the equivalent of processing both data sets in full, with a single parse.

proc sort data=temp1;
by id startdate;
run;

proc sort data=temp2;
by id startdate;
run;

data want (drop=EndDate OtherValue);
set
temp2 (in=T2)
temp1 (in=T1);
by id startdate;

format Last_StartDate date9.;
format Last_EndDate date9.;
format Last_OtherValue best12.0;

retain Last_StartDate;
retain Last_EndDate;
retain Last_OtherValue;

/*if lookup dataset, then copy the values*/
if t2 then do;
Last_StartDate = StartDate;
Last_EndDate = EndDate;
Last_OtherValue = OtherValue;
end;
/*if the first instance of ID is in table 1, or T2's range has been past, then clear the values*/
if (t1 and (first.ID or Last_EndDate < StartDate)) then do;
Last_StartDate = .;
Last_EndDate = .;
Last_OtherValue = .;
end;

if t1 then output;

run;

Ksharp · Posted 06-15-2011 11:13 PM

Yes. I suppose there is unique id in temp1.
If you have duplicate ids ,then sql 's Cartesian Product is good choice.
[pre]

data temp1;
infile datalines dlm=' ,';
input id $ StartDate : date9. Value;
format StartDate date9.;
datalines;
001,01JUL2010, 6050
001,01OCT2009, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500
;
run;
data temp2;
infile datalines dlm=' ,';
input id $ StartDate : date9. EndDate : date9. OtherValue;
datalines;
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC2010, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC2010,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990
;
run;

proc sql;
create table want as
select a.id as id,b.startdate as startdate,value,othervalue
from temp2 as a,temp1 as b
where a.id = b.id and b.startdate between a.startdate and a.enddate
;
quit;
[/pre]

Ksharp

Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

Re: Dates and Merging.

SAS Innovate 2025: Call for Content

Click image to register for webinar

Classroom Training Available!