Solved: How to perform a complex merge/matching procedure from two datasets - Page 2

machete · Posted 12-17-2012 04:40 PM

Ksharp, It works Thank you so much Below I have a small set of questions that are related on how to improve the code you provided and how to do a similar merge. Hope you will be able to support me on this.

I would like to make a similar merge of the datasets as the one described before. This time instead of exact time matching, I would like to “locate” the value in dataset “Chfreuters3” which is 60 seconds or e.g 1800 seconds after datetime in Chfdaily2. Any suggestions on how I should change the code you provided? Also I would like to do the opposite, locate the value which is 60 seconds before datetime in Chfdaily2. The difference from the exercise from before is that in the output table I would like to add only a specified number of variables, e.g datetime and midquote and not all from chfreuters3.
Going back to the code you provided: How datetime can be included in the final output table for dataset Chfreuters3, since datetime of dataset Chfdaily2 might be different of datetime of dataset Chfreuters3 in the output table. This is the case when we do not have an exact time match from the two sets and I would like to retain both new dateformats.
Can you have a look at the result I uploaded under filename want? Why do you think row 5 gives such a result?
For info here is the code I used – I only updated yours by using different datasets and library locations

libname neo v9 'd:\phd thesis\sas files';

data neo.c2;

set neo.Chfdaily2;

datetime=input(catx(':',value_date,newtime),anydtdtm20.);

format datetime datetime.;

run;

data neo.c3;

set neo.Chfr;

datetime=dhms(datepart(Date_G_),0,0,int(Time_G_));

format datetime datetime.;

run;

data neo.want;

if _n_ eq 1 then do;

if 0 then set c3;

declare hash ha(hashexp:16,dataset:'c3');

ha.definekey('_ric2','datetime');

ha.definedata(all:'Y');

ha.definedone();

end;

call missing(of _all_);

set c2;

_ric2 =cur1cur2;

datepart=datepart(datetime);

do while(ha.find() ne 0);

datetime+1;

if datepart ne datepart(datetime) then leave;

end;

drop datepart LOC Price Volume;

run;

Ksharp · Posted 12-17-2012 09:42 PM

OK. Let's look at it.

"I would like to “locate” the value in dataset “Chfreuters3” which is 60 seconds or e.g 1800 seconds after datetime in Chfdaily2.

I would like to add only a specified number of variables, e.g datetime and midquote and not all from chfreuters3"

That would be more simple than one before. See the code below.

"How datetime can be included in the final output table for dataset Chfreuters3, since datetime of dataset Chfdaily2 might be different of datetime of dataset Chfreuters3 in the output table."

Actually the datetime doesn't matter. I use datetime to locate the recode in Chfreuters3. The value of datetime will change during the data step. You can ignore it, just see the value_date,newtime in Chfdaily2 .

"Can you have a look at the result I uploaded under filename want? Why do you think row 5 gives such a result?"

You can find all the value of variables from Chfreuters3 are missing. That usually means we can't find a matched record from Chfreuters3. variable datetime will dynamically change during data step, therefore its value is not actually display its original value. You can ignore it , just see value_date,newtime in Chfdaily2 or Date_G_,Time_G_ in Chfreuters3 to identify the record .

For row 5:

datetime value_date newtime

21APR12:00:00:00 20/04/2012 22:02:19

We will start to locate the record from newtime=22:02:19 , and will find next newtime=22:02:20 and next and next again , until it cross the line of next day(datetime=21APR12:00:00:00 ). You demand locating the record in the same day, therefore we stop locating when it reach the tomorrow.

Was it more clear ?

libname x v9 'd:\software';
data  c2;
 set x.Chfdaily2;
 datetime=input(catx(':',value_date,newtime),anydtdtm20.);
format datetime datetime.;
run;
data c3;
 set x.Chfreuters3;
 datetime=dhms(datepart(Date_G_),0,0,int(Time_G_));
 format datetime datetime.;
run;
data want;
 if _n_ eq 1 then do;
 if 0 then set c3(keep=_ric2 datetime midquote);
  declare hash ha(hashexp:16,dataset:'c3');
   ha.definekey('_ric2','datetime');
   ha.definedata('datetime','midquote');/*add the variables you need to merge in*/
   ha.definedone();
 end;
call missing(of _all_);
set c2;
_ric2 =cur1cur2;
/*locate the value which is 60 seconds after datetime in Chfdaily2.
  if you want before ,change it to -60*/
datetime=datetime+60;
rc=ha.find();
drop rc _ric2;
run;

Ksharp

Message was edited by: xia keshan

machete · Posted 12-22-2012 07:28 AM

Hi Ksharp,

We are on the right way, I have however some more comments/questions:

1. I apologize, I did not explain explicitly what I wanted to do. I would like to do the same as the first matching we did this time having +60 seconds of newtime as the basis. That is locate the value in time_g_ which is +60sec and if there is no exact match to take the next closest to +60 seconds(e.g +60sec+1?). I tried to adjust the code as below but it did not work…any idea where the mistake might lie? I think the change should be added somewhere in this part:

set c2;

_ric2 =cur1cur2;

datepart=datepart(datetime+60);

do while(ha.find() ne 0);

datetime+1;

if datepart ne datepart(datetime+60) then leave;

end;

drop datepart LOC Price Volume;

run;

2. you wrote 'You can find all the value of variables from Chfreuters3 are missing. That usually means we can't find a matched record from Chfreuters3. variable datetime will dynamically change during data step, therefore its value is not actually display its original value. You can ignore it , just see value_date,newtime in Chfdaily2 or Date_G_,Time_G_ in Chfreuters3 to identify the record .

For row 5:

datetime value_date newtime

21APR12:00:00:00 20/04/2012 22:02:19

We will start to locate the record from newtime=22:02:19 , and will find next newtime=22:02:20 and next and next again , until it cross the line of next day(datetime=21APR12:00:00:00 ). You demand locating the record in the same day, therefore we stop locating when it reach the tomorrow.

Was it more clear ?'

Neo - However there is value that can be matched at time_g_ 22:02:45 in chfreuters3 but it is not matched, since the row/record is empty in our output table. why it is so?

3. I also wanted to ask whether this hash table solution will have any drawbacks in my further analysis. Will it create any limitations:

a. When I run this command for matching in the total datasets which are 2mln and 64mln observations (in our examples we dealt only with one currency and one day, my original dataset has 10 currencies and 80 days)

b. When I use the final table and start running my regressions/analysis

I am simply asking in order to have an idea in advance if I need to change smth to avoid such issues

Thanks and merry christmas to everybody

Neo

art297 · Posted 12-22-2012 01:23 PM

Your example datasets and desired output don't match. One data set has records from January 2012, while the other only has records from April 2012. And, since you refer to variables (sometimes) as Variable A, B, etc., rather than the actual variable names, there is room for confusion as to what you really want.

However, that said, your example big dataset doesn't have entries at specific intervals. A large number have records that are 59 seconds after the previous records, an equally large number have records that are 61 seconds after the previous record, and a small number have larger gaps between the times shown in the records.

You never said (or at least I can't find your response) whether you simply want to match the closest record, regardless of whether it comes before or after the record being matched, or if you are ONLY interest in the exact match or time that comes closest AFTER the datetime of the record from the smaller dataset. To the contrary, I'm seeing what appear to be confusing directions.

Please restate your question using the variable names that actually exist in the datasets and indicate what the real rules are. I think, trying to provide an explanation, you may have confused many of us.

Is the larger dataset already in a sorted order and, if so, in what order?

Ksharp · Posted 12-23-2012 09:09 PM

Oh, That is easy.

You just need to turn datetime+1; into datetime+60;. That is all. You don't need to change datepart which I used to justice whether it cross the line of tomorrow.

"Neo - However there is value that can be matched at time_g_ 22:02:45 in chfreuters3 but it is not matched"

I doubted that _ric2 not equal cur1cur2 . Check it more.

"3. I also wanted to ask whether this hash table solution will have any drawbacks in my further analysis. Will it create any limitations:"

The only limitation is computer memory. your dataset is too large to execute Hash Table.

"I am simply asking in order to have an idea in advance if I need to change smth to avoid such issues"

Can you elaborate it more ?

Ksharp

machete · Posted 01-05-2013 11:07 AM

Hi Ksharp,

Thanks for the feedback. Some comments from my side:

Regarding the new matching process, if I change datetime+60 only as you suggested I do not get the results I want. I want datetime +60 to replace datetime in chfdaily2. In more detail:

_ric2 equal cur1cur2 (as before) AND datetime in chfdaily equals datetime+60 in chfreuters. Also as before if datetime is not equal to datetime+60 then take exactly the next record e.g. datetime+60+1. In simple words I want to match the record in chfdaily2 with the record in chfreuters which is datetime+60seconds and if no exact match exists the record which is AFTER datetime+60, e.g datetime+61 or datetime+62 etc until one record is found.

Is it easy to update the code to perform the above matching process?

2. I checked what you mentioned: "Neo - However there is value that can be matched at time_g_ 22:02:45 in chfreuters3 but it is not matched" - Ksharp: I doubted that _ric2 not equal cur1cur2 . Check it more

Neo - You are right, there are no more records in date 20.04 to be matched, that is why the records stay empty.

3. I asked :"3. I also wanted to ask whether this hash table solution will have any drawbacks in my further analysis. Will it create any limitations:"

You answered: The only limitation is computer memory. your dataset is too large to execute Hash Table.

Neo - What do you suggest as a solution? to execute the code for each currency separately? That is to break down the large dataset into smaller parts?

Best

Neo

machete · Posted 02-05-2013 04:04 PM

Dear Ksharp

Are you able to respond to my remaining open questions 1 and 3 from below? Any answers to those questions are very much appreciated as they are vital in order to continue with a major part of my analysis.

Regarding the new matching process, if I change datetime+60 only as you suggested I do not get the results I want. I want datetime +60 to replace datetime in chfdaily2. In more detail:

_ric2 equal cur1cur2 (as before) AND datetime in chfdaily equals datetime+60 in chfreuters. Also as before if datetime is not equal to datetime+60 then take exactly the next record e.g. datetime+60+1. In simple words I want to match the record in chfdaily2 with the record in chfreuters which is datetime+60seconds and if no exact match exists the record which is AFTER datetime+60, e.g datetime+61 or datetime+62 etc until one record is found.

Is it easy to update the code you provided in the correct answer in order to perform the above matching process?

3. I also had the question :"3. I also wanted to ask whether this hash table solution will have any drawbacks in my further analysis. Will it create any limitations:"

You answered: The only limitation is computer memory. your dataset is too large to execute Hash Table.

Neo - What do you suggest as a solution? to execute the code for each currency separately? That is to break down the large dataset into smaller parts?

Thanks

Neo

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Re: How to perform a complex merge/matching procedure from two datasets

Classroom Training Available!