How to perform a complex merge/matching procedure from two datasets (P... - Page 2

art297 · Posted 02-16-2013 11:14 PM

I was suggesting making all of the times relative to the closest matching time based on your comment that "that means the next closest observation might be after one second, but for currencies of small countries the next closest trade might be in 2 mins as these currencies are not trade so often".

That is why I suggested your making the categories relative to the first match. However, I'm a Psychologist, so this is NOT really even close to my areas of expertise. And, besides, you (not me) are the decision maker.

machete · Posted 02-17-2013 07:45 AM

Hi Art,

I think if we treat this as two sub problems will be easier to solve. The first problem is datetime (in dataset A) <= the closest datetime (in dataset B) and then the second problem will be matching datetime (in dataset A) <= database B_datetime "+30 seconds / 1min / 30mins / 60mins / 24hrs as discussed in earlier posts" which will give us 5 more variables in the output.

I would suggest if you have already a code in mind to post it here so that I test it, then I think will be more easier to finalize the solution with small iterations

Neo

art297 · Posted 02-17-2013 01:00 PM

Your call! In your original post, had mentioned that some code had offered to solve a similar problem might be just what you were looking for. I still don't understand what you want for the initial match, but here is a modified version of the code that PGStats had offered. It simply gets the first Reuters' record that is greater than or equal to the chfdaily record. It assumes that the Reuters' data in sorted in datetime order:

libname art "c:\art";

data chfdaily2 (drop=value_date newtime);

length Cur1Cur2 $7;

format Cur1Cur2 $7.;

informat Cur1Cur2 $7.;

set art.chfdaily2;

format dt1 datetime19.;

dt1=dhms(input(value_date,ddmmyy10.),

hour(input(newtime,time8.)),

minute(input(newtime,time8.)),

second(input(newtime,time8.)));

run;

proc sort data=chfdaily2;

by dt1;

run;

data chfreuters3 (drop=date_G_ time_G_);

set art.chfreuters3;

format dt2 datetime19.;

dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

run;

data want;

retain start_rpointer;

set chfdaily2;

if _n_ eq 1 then start_rpointer=1;

current_rpointer=start_rpointer;

do while (dt1 ge dt2 and not have2end);

set chfreuters3 point=current_rpointer end=have2end;

if dt2 ge dt1 and Cur1Cur2 eq _ric2 then do;

output;

start_rpointer=current_rpointer;

end;

else do;

current_rpointer+1;

end;

run;

machete · Posted 02-24-2013 06:53 AM

Dear Art,

Thanks for the reply. This code worked for doing the first match. However I am facing two problems:

1. When I try to run this on a larger sample - one currency (14000 observations in the first dataset to be matched from 1,5million from Reuters) for the whole period (68 days) the system crashes after two hours of processing.I tried this a couple of times with other combinations of currencies but it still crashes. Is there a way we can adjust the code to execute stepwise day by day(which will mean to break down the dataset into 68 days automatically and then re-merge it)? Breaking down the dataset manually into days and currencies will take me probably days to do.Do you have any other possible solutions?

2. Can you provide an update solving my second problem as well - matching the reuters records which are 30,60,720 seconds further in time than the record in Dataset A etc Or shall we find a stable solution for point no 1 from above first and then deal with this one?

Thanks

Neo

art297 · Posted 02-24-2013 09:18 AM

Without knowing why it crashed, or actually seeing the data, it is impossible to say. Given that the code is rather straight forward, my guess would be that the Reuters data is not in datetime order or, possibly, it is in received datetime order, but has to be adjusted for timezones in order for the datetimes to be sequentially ordered. Have you manually looked at a section of the data (say, possibly, a one hour span in the middle of some weekday)?

From what you would observe, the solution may be obvious.

machete · Posted 02-24-2013 03:56 PM

Hi Art,

Reuters data are arranged based on Currencies and then by date, however i had already created a new file containing only one currency when running the code, thus with ordered date. The date format is standard, the total reuters dataset is based on GMT and during random days they are ordered based on time, beginning with 00:01 and ending at 23:59

Below are the error messages I have received when executing the code (i think also when executing one additional command). Do you have an idea what might be the problem?

WARNING: Multiple lengths were specified for the variable Cur1Cur2 by input data set(s). This

may cause truncation of data.

NOTE: There were 16529 observations read from the data set NEO.CHFDAILY.

NOTE: The data set WORK.CHFDAILY has 16529 observations and 42 variables.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

shifted by the "BEST" format.

NOTE: DATA statement used (Total process time):

real time 0.55 seconds

cpu time 0.59 seconds

389

390 proc sort data=chfdaily;

391 by dt1;

392 run;

NOTE: There were 16529 observations read from the data set WORK.CHFDAILY.

NOTE: The data set WORK.CHFDAILY has 16529 observations and 42 variables.

NOTE: PROCEDURE SORT used (Total process time):

real time 0.40 seconds

cpu time 0.45 seconds

393

394 data chfreuters (drop=date_G_ time_G_);

395 set neo.chfreuters;

396 format dt2 datetime19.;

397 dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

398 run;

NOTE: Missing values were generated as a result of performing an operation on missing values.

Each place is given by: (Number of times) at (Line):(Column).

4126234 at 397:7 4126234 at 397:12

NOTE: There were 4126234 observations read from the data set NEO.CHFREUTERS.

NOTE: The data set WORK.CHFREUTERS has 4126234 observations and 21 variables.

NOTE: DATA statement used (Total process time):

real time 35.60 seconds

cpu time 6.52 seconds

399

400 data want;

401 retain start_rpointer;

402 set chfdaily;

403 if _n_ eq 1 then start_rpointer=1;

404 current_rpointer=start_rpointer;

405 do while (dt1 ge dt2 and not have2end);

406 set chfreuters point=current_rpointer end=have2end;

407 if dt2 ge dt1 and Cur1Cur2 eq _ric2 then do;

408 output;

409 start_rpointer=current_rpointer;

410 end;

411 else do;

412 current_rpointer+1;

413 end;

414 end;

415 run;

NOTE: The DATA step has been abnormally terminated.

NOTE: There were 1 observations read from the data set WORK.CHFDAILY.

WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0

observations and 63 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time 1:45:49.23

cpu time 1:44:46.60

416

417

418 PROC SORT

419 data=neo.cobasetneo NODUP;

420 BY _all_;

421 RUN;

ERROR: No disk space is available for the write operation. Filename =

C:\Users\Ali\AppData\Local\Temp\SAS Temporary

Files\SAS_util00010000218C_Ali-PC\ut218C000002.utl.

ERROR: Failure while attempting to write page 910 of sorted run 120.

ERROR: Failure while attempting to write page 268660 to utility file 1.

ERROR: Failure encountered while creating initial set of sorted runs.

ERROR: Failure encountered during external sort.

ERROR: Sort execution failure.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 1890001 observations read from the data set NEO.COBASETNEO.

WARNING: The data set NEO.COBASETNEO may be incomplete. When this step was stopped there were

0 observations and 43 variables.

WARNING: Data set NEO.COBASETNEO was not replaced because this step was stopped.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

shifted by the "BEST" format.

NOTE: PROCEDURE SORT used (Total process time):

real time 28:12.22

cpu time 3:02.64

Do you think that making the datasets shorter (smaller number of variables) for the matching procedure and then remerging e.g based on the deal_Id would also be an option?

Let me know what you think

Thanks

Neo

art297 · Posted 02-24-2013 05:23 PM

Neo,

There are so many errors there it is hard to say where to begin.

The length error on the daily file may not be relevant, but you would have to check the file to insure that the currency information wasn't lost by changing the length to 7.

None of the Reuter's datetime info was created, thus I have to think that you used a file with a different structure than the one you originally posted. I don't have time to repeat the process but, when I originally tested the code, I recall that the conversion to dt worked. Your conversion obviously didn't and that would be critical to the effort.

The out of space errors are indicating that you don't have enough disk space to accomplish the task. That would be a serious problem.

Art

machete · Posted 02-25-2013 02:37 AM

Hi Art,

When I tried the code in the sample dataset it worked.

I tried once again on the a bigger set.

Looks like disk space is the problem, the procedure stopped when the file was 52GB...something should be wrong in the code(??) as both datasets used for solving this problem are around 1GB

NOTE: Libref NEO was successfully assigned as follows:

Engine: V9

Physical Name: d:\phd thesis\sas files

ERROR: Insufficient space in file NEO.WANTTEST.DATA.

NOTE: The DATA step has been abnormally terminated.

NOTE: There were 479 observations read from the data set WORK.CHFDAILY.

WARNING: The data set NEO.WANTTEST may be incomplete. When this step was stopped there were

6919415 observations and 62 variables.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

shifted by the "BEST" format.

NOTE: DATA statement used (Total process time):

real time 9:43:56.14

cpu time 2:46.42

Any suggestions?

Neo

art297 · Posted 02-25-2013 09:02 AM

If the format of the reuter's file is the same as the test data, then why is the datetime variable set to missing for every record? From your log:

394 data chfreuters (drop=date_G_ time_G_);

395 set neo.chfreuters;

396 format dt2 datetime19.;

397 dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

398 run;

NOTE: Missing values were generated as a result of performing an operation on missing values.

Each place is given by: (Number of times) at (Line):(Column).

4126234 at 397:7 4126234 at 397:12

NOTE: There were 4126234 observations read from the data set NEO.CHFREUTERS.

NOTE: The data set WORK.CHFREUTERS has 4126234 observations and 21 variables.

NOTE: DATA statement used (Total process time):

real time 35.60 seconds

cpu time 6.52 seconds

I.e., every single record was assigned a missing value.

machete · Posted 02-26-2013 01:53 AM

Hi,

I checked again the reuters source file, there are no dates missing. These were missing from the test file, the errors which are showed came from two subsequent estimations in SAS - ignore that one. I think the issue here is disk space and how to go about it - see my last post.

Neo

LinusH · Posted 02-26-2013 06:53 AM

I agree with Art, it's hard for us to evaluate if your errors could impact your problem or not.

Not having gone through every post here, but I wonder by the BY _all_;

Depending on the number of columns, this could be very tough to sort...

Do you really need to remove duplicates this way (your are sure of that they exist)?

If yes, you need to find some more resources, on disk. If you can assign more memory (MEMSIZE, SORTSIZE), it could lift some burden off I/O constraints.

Data never sleeps

art297 · Posted 02-26-2013 08:36 AM

Neo,

Check the Reuters file, again, but this time check for missing times. In one of your recent notes you indicated that times were only randomly assigned. Since the dt assignment formula had missing values assigned FOR EVERY RECORD, something is either missing or not of the form the code is expecting.

Regardless, if there are any missing times, that would likely invalidate the approach.

machete · Posted 05-06-2013 02:30 AM

Dear Art,

I am coming back once again to the open question from this post:

https://communities.sas.com/message/154166#154166

I used this code from below which shortened my reuters dataset substantially - up to 5 times less observations.

https://communities.sas.com/message/155689

Then I tried to re run your solution (this post) but my pc was executing for a couple of hours before I cancelled the statements. Last time the problem was disk space now simply the execution time was too long.

Any ideas how I could solve this issue? Maybe I should break down both datasets into one currency per dataset? Is there a code that will de - merge and re - merge the results because other wise it will take me ages to do that.

Plus i am not sure, will this solve my execution problem?

Let me know what you think

Thanks&have a nice day

Neo

machete · Posted 02-27-2013 02:37 AM

Hello you both,

Thnx for your comments. In order to avoid confusions:

The reported errors for missing values and the duplicates procedure from above is not related to the current problem we are discussing.

When I copy pasted the errors from SAS related to the problem we are discussing(matching) I took the above errors as well by mistake - they were part of other operations I executed in SAS.

That is why once again I confirm that the final Reuters dataset has no missing values(I crossed checked) and the only reason the code provided as solution did not execute was because of limited disk space - the operation stopped when the file was 50 GB, although the datasets used to solve the problem are times smaller.

Neo

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Re: How to perform a complex merge/matching procedure from two datasets (Part B)

Registration is open

SAS Training: Just a Click Away