Solved: Re: Working with Two Datasets

AbuChowdhury · Posted 11-15-2016 04:04 PM

Dear Experts,

I have two datasets which I attached: one is trade_data and another is announce_data. I need to creat a new dataset as following: if the tradedate of a cusip (i.e. firm) in the "trade_data" is within ±1 day of the announce date in the "announce_data" file, then the observation of that tradeday of that firm will have to be removed. For example, a firm has an announce date on 16 november 2004 in "announce_data". If that firm has a trade date either on 15 november 2004 or 16 november 2004 or 17 november 2004 then the observation of that trade date will have to be removed.

ChrisNZ · Posted 11-15-2016 04:31 PM

Like this?

data ANNOUNCE;
input CUSIP ANNOUNCE_DATE : ddmmyy10.;
cards;
00036110 08/01/2004
00036110 29/06/2004
00036110 22/09/2004
00036110 17/12/2004
00036110 16/03/2005
run;

data TRADE;
input TRADE_DATE ddmmyy10. YEAR CUSIP ;
cards;
05/01/2004 2004 00036110
05/01/2004 2004 00036110
06/01/2004 2004 00036110
07/01/2004 2004 00036110
23/01/2004 2004 00036110
run;

data WANT;
  set TRADE;
  if _N_=1 then do;
    dcl hash ANNOUNCE(dataset:'ANNOUNCE');
    ANNOUNCE.definekey('CUSIP','ANNOUNCE_DATE');
    ANNOUNCE.definedone();
    call missing (ANNOUNCE_DATE);
  end;
  if ANNOUNCE.check(key:CUSIP, key:TRADE_DATE-1) =0 
   | ANNOUNCE.check(key:CUSIP, key:TRADE_DATE  ) =0 
   | ANNOUNCE.check(key:CUSIP, key:TRADE_DATE+1) =0 then delete;
run;

High-Performance SAS Coding - Third Edition

View solution in original post

ChrisNZ · Posted 11-15-2016 04:31 PM

Like this?

data ANNOUNCE;
input CUSIP ANNOUNCE_DATE : ddmmyy10.;
cards;
00036110 08/01/2004
00036110 29/06/2004
00036110 22/09/2004
00036110 17/12/2004
00036110 16/03/2005
run;

data TRADE;
input TRADE_DATE ddmmyy10. YEAR CUSIP ;
cards;
05/01/2004 2004 00036110
05/01/2004 2004 00036110
06/01/2004 2004 00036110
07/01/2004 2004 00036110
23/01/2004 2004 00036110
run;

data WANT;
  set TRADE;
  if _N_=1 then do;
    dcl hash ANNOUNCE(dataset:'ANNOUNCE');
    ANNOUNCE.definekey('CUSIP','ANNOUNCE_DATE');
    ANNOUNCE.definedone();
    call missing (ANNOUNCE_DATE);
  end;
  if ANNOUNCE.check(key:CUSIP, key:TRADE_DATE-1) =0 
   | ANNOUNCE.check(key:CUSIP, key:TRADE_DATE  ) =0 
   | ANNOUNCE.check(key:CUSIP, key:TRADE_DATE+1) =0 then delete;
run;

High-Performance SAS Coding - Third Edition

AbuChowdhury · Posted 11-16-2016 05:40 AM

Thanks a lot. I think it works. By the by, is it possible to solve by using proc sql?

ChrisNZ · Posted 11-17-2016 12:05 AM

Like this?


proc sql;
  select *    
  from TRADE
  where not exists(select * 
                   from ANNOUNCE 
                   where TRADE.CUSIP = ANNOUNCE.CUSIP
                     and TRADE_DATE-1 <= ANNOUNCE_DATE <= TRADE_DATE+1 );
quit;

High-Performance SAS Coding - Third Edition

AbuChowdhury · Posted 11-17-2016 10:46 AM

Hi,

SAS takes very long time if I use the proc sql. Even I needed to terminate the current process because SAS stopped working.

ChrisNZ · Posted 11-17-2016 03:52 PM

Yes the exists function is really slow in SAS, and it shows on large tables.

Hash tables are much faster.

So have your requirements changed? Is it now trading days and not calendar days?

If so you need a list of tradings days, your TRADE table doesn't have it.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 11-17-2016 06:08 PM

A left join should be faster than exists()

proc sql;
select TRADE.CUSIP, TRADE_DATE
from TRADE
    left join
       ANNOUNCE
    on TRADE.CUSIP = ANNOUNCE.CUSIP
group by 1, 2
having min(abs(sum(TRADE_DATE,-ANNOUNCE_DATE))) >1;
quit;

High-Performance SAS Coding - Third Edition

AbuChowdhury · Posted 11-18-2016 02:28 AM

My trade table has trading days (the variable tradedate), not calendar days. So are your codes based on calendar days? My requirement was: if the tradedate in the Trade table is within ±1 calendar days of the announcement date in the Announce table then that traddate in the Trade table will have to removed.

mkeintz · Posted 11-16-2016 09:26 PM

It's written as two steps, but it's really only one pass of the data since ANN_WINDOW is a data seet VIEW, not a data set FILE:

/* Make a data set view of trade dates that should be excluded*/
data ann_window/view=ann_window;
  set ann;
  do trade_date=announce_date-1 to announce_date+1;
    output;
  end;
run;

data want;
  merge trades ann_window (in=inw);
  by cusip trade_date;
  if inw=0;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AbuChowdhury · Posted 11-17-2016 10:44 AM

Hi,

For the second datastep in your first reply, merging does not work since the two datasets need to be sorted at first. Since ann_window is a view table it cannot be sorted. So what can be done?

mkeintz · Posted 11-17-2016 11:28 AM

Your annoncements data set is not likely to be large, so sort it before making the windows. I assume your trades dataset is already sorted by cusip/date.

MK

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

mkeintz · Posted 11-17-2016 08:41 AM

Neither the PRoC SQL nor the hash solution took advantage of the sorted order of your data sets, which is why I offered the solution using the data set view.

However, a more challenging (and common problem for financial researchers) is to elminate a three-trading-windows centered on the announcment date, rather than a three calendar-day solutions. this wold make the sql and hash a good deal more complicated, but the data-set-view technique holds up. The main change below is the second SET statement in the DATA WANT step. It has the FIRSTOBS=2 parement, used to look ahead one record, which in turn allows determination of whether the record-in-hand is one tradyiing day prior to an announcement.

data ta / view=ta;
  merge Trades  Announce (in=ina rename=(announce_date=trade_date));
  by cusip trade_date;
  in_a=ina;
run;

data want;
  set ta;
  by cusip;

  /* read ahead one record */
  if eof=0 then set ta (firstobs=2 keep=in_a rename=(in_a=next_in_a)) end=eof;
  else next_in_a=0;

  if in_a or
     (lag(in_a) and not(first.cusip)) or 
     (next_in_a and not(last.cusip))
  then delete;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AbuChowdhury · Posted 11-17-2016 11:34 AM

Following errors are shown for those codes:

ERROR 22-7: Invalid option name END.

ERROR 6-185: Missing ')' parenthesis for data set option list.

ERROR 79-322: Expecting a ).

mkeintz · Posted 11-17-2016 11:44 AM

Forgot a closing parens. I've edited the program.Try it again.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

AbuChowdhury · Posted 11-17-2016 12:00 PM

Now it shows the following error:

ERROR: File WORK.VTEMP.DATA does not exist.

Registration is open

SAS Training: Just a Click Away