BookmarkSubscribeRSS Feed
SamT
Calcite | Level 5
Hello,

I've got two data sets.

I'm trying to get these...

id,StartDate, Value,
001,01JUL2010, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500

into...

id, StartDate, EndDate, OtherValue
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC9999, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC9999,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990

and return the other value that would have been during that time...

id, StartDate, Value, OtherValue
001,01JUL2010, 6050, 6000
002,27MAY2010, 9050, 11000
003,11DEC2010, 16500, 20000

So I have a variable to merge on but the startdate variable in the first set needs to read in between the start and end date in the second one and then output the other value variable.

I'm not very experienced with any merging outside of indexing...

data name (index=());
set sample;
set sample2 key=/unique;
if _error_ then delete (or output...);
run;

Thanks in advance.
5 REPLIES 5
Ksharp
Super User
Or If you like, Hash Table also do it with more efficient.


[pre]



data temp1;
infile datalines dlm=' ,';
input id $ StartDate : date9. Value;
format StartDate date9.;
datalines;
001,01JUL2010, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500
;
run;
data temp2;
infile datalines dlm=' ,';
input id $ StartDate : date9. EndDate : date9. OtherValue;
datalines;
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC9999, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC9999,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990
;
run;
data want(drop=_StartDate EndDate);
merge temp1 temp2(rename=(StartDate=_StartDate));
by id;
if StartDate ge _StartDate and StartDate le EndDate ;
run;
[/pre]




Ksharp
SamT
Calcite | Level 5
Hi KSharp,

Ran you practise script and it worked a treat, applied it to what I was working on and it produced some kind of monster.

Reason I suspect i'm not obtaining the desired results would be that the ID number in the first set can appear multiple times, with a unique StartDate.

What other methods are there of merging?
sbb
Lapis Lazuli | Level 10 sbb
Lapis Lazuli | Level 10
Have a look at BY GROUP PROCESSING in the SAS.COM support site.

Scott Barry
SBBWorks, Inc.

Suggested Google advanced search argument, this topic / post:

data step by group processing site:sas.com
DF
Fluorite | Level 6 DF
Fluorite | Level 6
I think KSharp's method requires that the ID in table temp1 is unique, otherwise a merge would start producing weird results.

I have recently had a similar issue in joining up large time series data. I found the notes on Fuzzy Merges in http://www2.sas.com/proceedings/sugi24/Advtutor/p46-24.pdf to be quite useful in working out a solution.

Assuming that your "lookup" table - Temp2 in KSharp's example code - has no overlapping time ranges, then this should work. I used this method over something like an SQL join because both of my datasets are huge (about 5 million in Temp1 and about 100 million in Temp2).

By using an interleave, I reckon you're only really doing the equivalent of processing both data sets in full, with a single parse.


proc sort data=temp1;
by id startdate;
run;

proc sort data=temp2;
by id startdate;
run;

data want (drop=EndDate OtherValue);
set
temp2 (in=T2)
temp1 (in=T1);
by id startdate;

format Last_StartDate date9.;
format Last_EndDate date9.;
format Last_OtherValue best12.0;

retain Last_StartDate;
retain Last_EndDate;
retain Last_OtherValue;

/*if lookup dataset, then copy the values*/
if t2 then do;
Last_StartDate = StartDate;
Last_EndDate = EndDate;
Last_OtherValue = OtherValue;
end;
/*if the first instance of ID is in table 1, or T2's range has been past, then clear the values*/
if (t1 and (first.ID or Last_EndDate < StartDate)) then do;
Last_StartDate = .;
Last_EndDate = .;
Last_OtherValue = .;
end;

if t1 then output;

run;
Ksharp
Super User
Yes. I suppose there is unique id in temp1.
If you have duplicate ids ,then sql 's Cartesian Product is good choice.
[pre]




data temp1;
infile datalines dlm=' ,';
input id $ StartDate : date9. Value;
format StartDate date9.;
datalines;
001,01JUL2010, 6050
001,01OCT2009, 6050
002,27MAY2010, 9050
003, 11DEC2010, 16500
;
run;
data temp2;
infile datalines dlm=' ,';
input id $ StartDate : date9. EndDate : date9. OtherValue;
datalines;
001, 01SEP2009,17DEC2009, 7500
001, 18DEC2009,29APR2010, 7000
001, 30APR2010,31DEC2010, 6000
002, 10MAY2010,16AUG2010, 11000
002, 17AUG2010,26NOV2010,10500
002, 27NOV2010,11APR2011,9500
002, 12APR2011,31DEC2010,9000
003, 01DEC2010,12JAN2011,20000
003, 13JAN2011, 25MAY2011,19500
003, 26MAY2011, 31DEC9999, 18990
;
run;

proc sql;
create table want as
select a.id as id,b.startdate as startdate,value,othervalue
from temp2 as a,temp1 as b
where a.id = b.id and b.startdate between a.startdate and a.enddate
;
quit;
[/pre]


Ksharp

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 854 views
  • 0 likes
  • 4 in conversation