BookmarkSubscribeRSS Feed
art297
Opal | Level 21

I was suggesting making all of the times relative to the closest matching time based on your comment that "that means the next closest observation might be after one second, but for currencies of small countries the next closest trade might be in 2 mins as these currencies are not trade so often".

That is why I suggested your making the categories relative to the first match.  However, I'm a Psychologist, so this is NOT really even close to my areas of expertise.  And, besides, you (not me) are the decision maker.

machete
Calcite | Level 5

Hi Art,

I think if we treat this as two sub problems will be easier to solve. The first problem is datetime (in dataset A) <= the closest datetime (in dataset B) and then the second problem will be matching datetime (in dataset A) <= database B_datetime "+30 seconds / 1min / 30mins / 60mins / 24hrs as discussed in earlier posts" which will give us 5 more variables in the output.


I would suggest if you have already a code in mind to post it here so that I test it, then I think will be more easier to finalize the solution with small iterations


Neo

art297
Opal | Level 21

Your call!  In your original post, had mentioned that some code had offered to solve a similar problem might be just what you were looking for.  I still don't understand what you want for the initial match, but here is a modified version of the code that PGStats had offered.  It simply gets the first Reuters' record that is greater than or equal to the chfdaily record.  It assumes that the Reuters' data in sorted in datetime order:

libname art "c:\art";

data chfdaily2 (drop=value_date newtime);

  length Cur1Cur2 $7;

  format Cur1Cur2 $7.;

  informat Cur1Cur2 $7.;

  set art.chfdaily2;

  format dt1 datetime19.;

  dt1=dhms(input(value_date,ddmmyy10.),

     hour(input(newtime,time8.)),

     minute(input(newtime,time8.)),

     second(input(newtime,time8.)));

run;

proc sort data=chfdaily2;

  by dt1;

run;

data chfreuters3 (drop=date_G_ time_G_);

  set art.chfreuters3;

  format dt2 datetime19.;

  dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

run;

data want;

  retain start_rpointer;

  set chfdaily2;

  if _n_ eq 1 then start_rpointer=1;

  current_rpointer=start_rpointer;

  do while (dt1 ge dt2 and not have2end);

    set chfreuters3 point=current_rpointer end=have2end;

    if dt2 ge dt1 and Cur1Cur2 eq _ric2 then do;

      output;

      start_rpointer=current_rpointer;

    end;

    else do;

      current_rpointer+1;

    end;

  end;

run;

machete
Calcite | Level 5

Dear Art,

Thanks for the reply. This code worked for doing the first match. However I am facing two problems:

1. When I try to run this on a larger sample - one currency (14000 observations in the first dataset to be matched from 1,5million from Reuters) for the whole period (68 days) the system crashes after two hours of processing.I tried this a couple of times with other combinations of currencies but it still crashes. Is there a way we can adjust the code to execute stepwise day by day(which will mean to break down the dataset into 68 days automatically and then re-merge it)? Breaking down the dataset manually into days and currencies will take me probably days to do.Do you have any other possible solutions?

2. Can you provide an update solving my second problem as well - matching the reuters records which are 30,60,720 seconds further in time than the record in Dataset A etc Or shall we find a stable solution for point no 1 from above first and then deal with this one?

Thanks

Neo

art297
Opal | Level 21

Without knowing why it crashed, or actually seeing the data, it is impossible to say.  Given that the code is rather straight forward, my guess would be that the Reuters data is not in datetime order or, possibly, it is in received datetime order, but has to be adjusted for timezones in order for the datetimes to be sequentially ordered.  Have you manually looked at a section of the data (say, possibly, a one hour span in the middle of some weekday)?

From what you would observe, the solution may be obvious.

machete
Calcite | Level 5

Hi Art,

Reuters data are arranged based on Currencies and then by date, however i had already created a new file containing only one currency when running the code, thus with ordered date. The date format is standard, the total reuters dataset is based on GMT and during random days they are ordered based on time, beginning with 00:01 and ending at 23:59

Below are the error messages I have received when executing the code (i think also when executing one additional command). Do you have an idea what might be the problem?

WARNING: Multiple lengths were specified for the variable Cur1Cur2 by input data set(s). This

         may cause truncation of data.

NOTE: There were 16529 observations read from the data set NEO.CHFDAILY.

NOTE: The data set WORK.CHFDAILY has 16529 observations and 42 variables.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

      shifted by the "BEST" format.

NOTE: DATA statement used (Total process time):

      real time           0.55 seconds

      cpu time            0.59 seconds

389

390  proc sort data=chfdaily;

391    by dt1;

392  run;

NOTE: There were 16529 observations read from the data set WORK.CHFDAILY.

NOTE: The data set WORK.CHFDAILY has 16529 observations and 42 variables.

NOTE: PROCEDURE SORT used (Total process time):

      real time           0.40 seconds

      cpu time            0.45 seconds

393

394  data chfreuters (drop=date_G_ time_G_);

395    set neo.chfreuters;

396    format dt2 datetime19.;

397 dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

398  run;

NOTE: Missing values were generated as a result of performing an operation on missing values.

      Each place is given by: (Number of times) at (Line):(Column).

      4126234 at 397:7    4126234 at 397:12

NOTE: There were 4126234 observations read from the data set NEO.CHFREUTERS.

NOTE: The data set WORK.CHFREUTERS has 4126234 observations and 21 variables.

NOTE: DATA statement used (Total process time):

      real time           35.60 seconds

      cpu time            6.52 seconds

399

400  data want;

401    retain start_rpointer;

402    set chfdaily;

403    if _n_ eq 1 then start_rpointer=1;

404    current_rpointer=start_rpointer;

405    do while (dt1 ge dt2 and not have2end);

406      set chfreuters point=current_rpointer end=have2end;

407      if dt2 ge dt1 and Cur1Cur2 eq _ric2 then do;

408        output;

409        start_rpointer=current_rpointer;

410      end;

411      else do;

412        current_rpointer+1;

413      end;

414    end;

415  run;

NOTE: The DATA step has been abnormally terminated.

NOTE: There were 1 observations read from the data set WORK.CHFDAILY.

WARNING: The data set WORK.WANT may be incomplete. When this step was stopped there were 0

         observations and 63 variables.

WARNING: Data set WORK.WANT was not replaced because this step was stopped.

NOTE: DATA statement used (Total process time):

real time           1:45:49.23

      cpu time            1:44:46.60

416

417

418  PROC SORT

419  data=neo.cobasetneo NODUP;

420  BY _all_;

421  RUN;

ERROR: No disk space is available for the write operation.  Filename =

       C:\Users\Ali\AppData\Local\Temp\SAS Temporary

Files\SAS_util00010000218C_Ali-PC\ut218C000002.utl.

ERROR: Failure while attempting to write page 910 of sorted run 120.

ERROR: Failure while attempting to write page 268660 to utility file 1.

ERROR: Failure encountered while creating initial set of sorted runs.

ERROR: Failure encountered during external sort.

ERROR: Sort execution failure.

NOTE: The SAS System stopped processing this step because of errors.

NOTE: There were 1890001 observations read from the data set NEO.COBASETNEO.

WARNING: The data set NEO.COBASETNEO may be incomplete. When this step was stopped there were

         0 observations and 43 variables.

WARNING: Data set NEO.COBASETNEO was not replaced because this step was stopped.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

      shifted by the "BEST" format.

NOTE: PROCEDURE SORT used (Total process time):

      real time           28:12.22

      cpu time            3:02.64

Do you think that making the datasets shorter (smaller number of variables) for the matching procedure and then remerging e.g based on the deal_Id would also be an option?

Let me know what you think

Thanks

Neo

art297
Opal | Level 21

Neo,

There are so many errors there it is hard to say where to begin.

The length error on the daily file may not be relevant, but you would have to check the file to insure that the currency information wasn't lost by changing the length to 7.

None of the Reuter's datetime info was created, thus I have to think that you used a file with a different structure than the one you originally posted.  I don't have time to repeat the process but, when I originally tested the code, I recall that the conversion to dt worked.  Your conversion obviously didn't and that would be critical to the effort.

The out of space errors are indicating that you don't have enough disk space to accomplish the task.  That would be a serious problem.

Art

machete
Calcite | Level 5

Hi Art,

When I tried the code in the sample dataset it worked.

I tried once again on the a bigger set.

Looks like disk space is the problem, the procedure stopped when the file was 52GB...something should be wrong in the code(??) as both datasets used for solving this problem are around 1GB

NOTE: Libref NEO was successfully assigned as follows:

      Engine:        V9

      Physical Name: d:\phd thesis\sas files

ERROR: Insufficient space in file NEO.WANTTEST.DATA.

NOTE: The DATA step has been abnormally terminated.

NOTE: There were 479 observations read from the data set WORK.CHFDAILY.

WARNING: The data set NEO.WANTTEST may be incomplete.  When this step was stopped there were

         6919415 observations and 62 variables.

NOTE: At least one W.D format was too small for the number to be printed. The decimal may be

      shifted by the "BEST" format.

NOTE: DATA statement used (Total process time):

      real time           9:43:56.14

      cpu time            2:46.42


Any suggestions?


Neo

art297
Opal | Level 21

If the format of the reuter's file is the same as the test data, then why is the datetime variable set to missing for every record?  From your log:

394  data chfreuters (drop=date_G_ time_G_);

395    set neo.chfreuters;

396    format dt2 datetime19.;

397 dt2=dhms(datepart(date_G_),hour(time_G_),minute(time_G_),second(time_G_));

398  run;

NOTE: Missing values were generated as a result of performing an operation on missing values.

      Each place is given by: (Number of times) at (Line):(Column).

      4126234 at 397:7    4126234 at 397:12

NOTE: There were 4126234 observations read from the data set NEO.CHFREUTERS.

NOTE: The data set WORK.CHFREUTERS has 4126234 observations and 21 variables.

NOTE: DATA statement used (Total process time):

      real time           35.60 seconds

      cpu time            6.52 seconds


I.e., every single record was assigned a missing value.


machete
Calcite | Level 5

Hi,

I checked again the reuters source file, there are no dates missing. These were missing from the test file, the errors which are showed came from two subsequent estimations in SAS - ignore that one.  I think the issue here is disk space and how to go about it - see my last post.

Neo

LinusH
Tourmaline | Level 20

I agree with Art, it's hard for us to evaluate if your errors could impact your problem or not.

Not having gone through every post here, but I wonder by the BY _all_;

Depending on the number of columns, this could be very tough to sort...

Do you really need to remove duplicates this way (your are sure of that they exist)?

If yes, you need to find some more resources, on disk. If you can assign more memory (MEMSIZE, SORTSIZE), it could lift some burden off I/O constraints.

Data never sleeps
art297
Opal | Level 21

Neo,

Check the Reuters file, again, but this time check for missing times.  In one of your recent notes you indicated that times were only randomly assigned.  Since the dt assignment formula had missing values assigned FOR EVERY RECORD, something is either missing or not of the form the code is expecting.

Regardless, if there are any missing times, that would likely invalidate the approach.

machete
Calcite | Level 5

Dear Art,

I am coming back once again to the open question from this post:

https://communities.sas.com/message/154166#154166

I used this code from below which shortened my reuters dataset substantially - up to 5 times less observations.

https://communities.sas.com/message/155689

Then I tried to re run your solution (this post) but my pc was executing for a couple of hours before I cancelled the statements. Last time the problem was disk space now simply the execution time was too long.

Any ideas how I could solve this issue? Maybe I should break down both datasets into one currency per dataset? Is there a code that will de - merge and re - merge the results because other wise it will take me ages to do that.

Plus i am not sure, will this solve my execution problem?

Let me know what you think

Thanks&have a nice day

Neo

machete
Calcite | Level 5

Hello you both,

Thnx for your comments. In order to avoid confusions:

The reported errors for missing values and the duplicates procedure from above is not related to the current problem we are discussing.

When I copy pasted the errors from SAS related to the problem we are discussing(matching) I took the above errors as well by mistake - they were part of other operations I executed in SAS.

That is why once again I confirm that the final Reuters dataset has no missing values(I crossed checked) and the only reason the code provided as solution did not execute  was because of limited disk space - the operation stopped when the file was 50 GB, although the datasets used to solve the problem are  times smaller.

Neo

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 28 replies
  • 1776 views
  • 3 likes
  • 3 in conversation