BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
MadQuant
Obsidian | Level 7

Hi all,

 

I am new to SAS.

My sample data looks like below. (I couldn't figure out the data step to re-create the sample data, so refer to the screenshot please)

 

data testdata;
infile datalines dsd;
input DATE TIME_M EX SYM_ROOT SYM_SUFFIX TR_SCOND SIZE PRICE TR_STOPIND TR_CORR TR_SEQNUM TR_SOURCE TR_RF ;
datalines;
20141224, 9:30:01.006, N, A, , O ,11203, 41.4200, N, 00, 1112, C,
20141224, 9:30:02.844, N, A, , I , 17, 41.4300, N, 00, 1564, C,
20141224, 9:30:05.418, D, A, , @ ,100, 41.4300, N, 00, 1770, C, T
20141224, 9:30:05.419, D, A, , @ ,100, 41.4100, N, 00, 1771, C, T
20141224, 9:30:05.424, D, A, , @ ,100, 41.4100, N, 00, 1772, C, T
20141224, 9:30:05.449, D, A, , @ ,100, 41.4200, N, 00, 1780, C, T
20141224, 9:30:05.565, D, A, , @ ,100, 41.4100, N, 00, 1794, C, T
20141224, 9:30:12.057, N, A, , FI, 4, 41.5000, N, 00, 2148, C,
20141224, 9:30:14.066, D, A, , I , 33, 41.4201, N, 00, 2202, C, T
20141224, 9:30:37.812, N, A, , FI, 37, 41.5000, N, 00, 3019, C,
20141224, 9:31:00.744, T, A, , @ ,100, 41.5000, N, 00, 3900, C,
20141224, 9:31:00.744, T, A, , @ ,100, 41.5000, N, 00, 3901, C,
20141224, 9:31:00.744, T, A, , Q ,100, 41.5000, N, 00, 3902, C,
20141224, 9:31:00.744, K, A, , F ,100, 41.5000, N, 00, 3903, C,
20141224, 9:31:00.744, Z, A, , F ,100, 41.5000, N, 00, 3904, C,
20141224, 9:31:00.809, T, A, , FI, 15, 41.5300, N, 00, 3905, C,
20141224, 9:31:00.853, D, A, , @ ,100, 41.5000, N, 00, 3908, C, T
20141224, 9:31:41.154, P, A, , Q ,150, 41.5250, N, 00, 6211, C,
20141224, 9:31:41.154, P, A, , O ,150, 41.5250, N, 00, 6212, C,
20141224, 9:31:41.154, P, A, , F ,150, 41.5250, N, 00, 6213, C,
20141224, 9:31:41.154, T, A, , F ,100, 41.5200, N, 00, 6214, C,
20141224, 9:31:41.154, T, A, , FI, 50, 41.5200, N, 00, 6215, C,
20141224, 9:31:41.155, T, A, , FI, 50, 41.5200, N, 00, 6216, C,
20141224, 9:31:41.155, P, A, , F ,100, 41.5200, N, 00, 6217, C,
20141224, 9:31:41.155, T, A, , FI, 37, 41.5200, N, 00, 6218, C,
20141224, 9:31:41.155, K, A, , F ,100, 41.5200, N, 00, 6219, C,
20141224, 9:31:41.155, P, A, , F ,100, 41.5200, N, 00, 6220, C,
20141224, 9:31:41.155, P, A, , FI, 50, 41.5200, N, 00, 6221, C,
20141224, 9:31:41.155, P, A, , FI, 50, 41.5200, N, 00, 6222, C,
20141224, 9:31:41.155, T, A, , I , 63, 41.5200, N, 00, 6223, C,
20141224, 9:31:41.155, T, A, , @ ,137, 41.5200, N, 00, 6224, C,
20141224, 9:31:41.155, T, A, , FI, 78, 41.5200, N, 00, 6225, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6226, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6227, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6228, C,
20141224, 9:31:41.156, K, A, , F ,100, 41.5100, N, 00, 6229, C,
20141224, 9:31:41.156, T, A, , F ,100, 41.5100, N, 00, 6230, C,
20141224, 9:31:41.156, P, A, , FI, 22, 41.5000, N, 00, 6231, C,
20141224, 9:31:41.157, K, A, , FI, 22, 41.5000, N, 00, 6232, C,
20141224, 9:31:41.158, P, A, , FI, 78, 41.5000, N, 00, 6233, C,
20141224, 9:31:41.158, P, A, , F ,100, 41.5000, N, 00, 6234, C,
20141224, 9:31:41.158, T, A, , @ ,100, 41.5000, N, 00, 6235, C,
20141224, 9:31:41.158, T, A, , @ ,100, 41.5000, N, 00, 6236, C,
20141224, 9:31:41.158, T, A, , F ,100, 41.5000, N, 00, 6237, C,
20141224, 9:31:41.158, K, A, , FI, 78, 41.5000, N, 00, 6238, C,
20141224, 9:31:41.158, K, A, , F ,100, 41.5000, N, 00, 6239, C,
20141224, 9:31:41.201, D, A, , @ ,100, 41.5000, N, 00, 6240, C, T
20141224, 9:31:41.229, P, A, , F ,100, 41.4800, N, 00, 6242, C,
20141224, 9:31:41.231, N, A, , F ,100, 41.4600, N, 00, 6244, C,
20141224, 9:31:41.231, T, A, , F ,100, 41.4700, N, 00, 6245, C,
;

 

Trade dataTrade data

 

This is part of WRDS TAQ (Trade and Qoute) data. The entire data set is enormous. For each day, the number of obs is of order of 10 million. So, even working on WRDS cloud server, which is presumably much faster than my local machine, splitting the dataset is mandatory to do any kind of analysis.

 

I am planning to split the dataset in 40 smaller datasets by ticker symbols of stocks which are SYM_ROOT in this data set.

Now I have a master file for each day that contains distinct SYM_ROOT values. Please see below.

 

data test_master;
infile datalines dsd;
input DATE CUSIP SYMBOL_ROOT SYMBOL_SUFFIX LISTED_MARKET TAPE SEC_DESC;
datalines;
DATE CUSIP SYMBOL_ROOT SYMBOL_SUFFIX LISTED_MARKET TAPE SEC_DESC
20141224, 00846U101, A , , N, A, AGILENT TECHNOLOGIES INC.
20141224, 013817101, AA , , N, A, ALCOA INC
20141224, 013817200, AA , PR , A, B, ALCOA INC. $3.75 PREFERRED STOCK
20141224, 013817309, AA , PRB, N, A, ALCOA INC.
20141224, 000307108, AAC , , N, A, AAC HOLDINGS INC.
20141224, 00768Y206, AADR, , P, B, WCM / BNY MELLON FOCUSED GROWTH ADR ETF
20141224, 46429B374, AAIT, , T, C, ISHARES MSCI ALL COUNTRY ASIA INFORMATION TECHNOLOGY ETF
20141224, 02376R102, AAL , , T, C, AMERICAN AIRLINES GROUP INC.
20141224, 02153X108, AAMC, , A, B, ALTISOURCE ASSET MGMT CORP
20141224, 048209100, AAME, , T, C, ATLANTIC AMERICAN CORP
20141224, 002535300, AAN , , N, A, AARON'S INC.
20141224, 03823U102, AAOI, , T, C, APPLIED OPTOELECTRONICS INC.
20141224, 000360206, AAON, , T, C, AAON INC
20141224, 00751Y106, AAP , , N, A, ADVANCE AUTO PARTS INC
20141224, 037833100, AAPL, , T, C, APPLE INC.
20141224, 024013104, AAT , , N, A, AMERICAN ASSETS TRUST INC.
20141224, 020283107, AAU , , A, B, ALMADEN MINERALS LTD.
20141224, 00765F101, AAV , , N, A, ADVANTAGE OIL & GAS LTD.
20141224, 05337G107, AAVL, , T, C, AVALANCHE BIOTECHNOLOGIES, INC
20141224, 049164205, AAWW, , T, C, ATLAS AIR WORLDWIDE HOLDINGS INC.
20141224, 464288182, AAXJ, , T, C, ISHARES MSCI ALL COUNTRY ASIA EX JAPAN ETF
20141224, 01881G106, AB , , N, A, ALLIANCEBERNSTEIN HOLDING L.P.
20141224, G0404E104, ABAC, , T, C, AOXIN TIANLI GROUP INC.
20141224, 002567105, ABAX, , T, C, ABAXIS INC
20141224, 000375204, ABB , , N, A, ABB LTD.
20141224, 00287Y109, ABBV, , N, A, ABBVIE INC.
20141224, 03073E105, ABC , , N, A, AMERISOURCEBERGEN CORP.
20141224, 03076K108, ABCB, , T, C, AMERIS BANCORP
20141224, 13201A107, ABCD, , T, C, CAMBIUM LEARNING GROUP INC.
20141224, 00762W107, ABCO, , T, C, ADVISORY BOARD CO/THE
20141224, 03283P106, ABCW, , T, C, ANCHOR BANCORP WISCONSIN INC
20141224, 01374T102, ABDC, , T, C, ALCENTRA CAPITAL CORP
20141224, 02319V103, ABEV, , N, A, AMBEV S.A.
20141224, 043436104, ABG , , N, A, ASBURY AUTO GROUP AB ADS
20141224, 00289R201, ABGB, , T, C, ABENGOA, S.A. AMERICAN DEPOSITARY SHARES
20141224, 00211Y209, ABIO, , T, C, ARCA BIOPHARMA INC.
20141224, 000957100, ABM , , N, A, ABM INDUSTRIES INC.
20141224, 003654100, ABMD, , T, C, ABIOMED INC
20141224, 038923108, ABR , , N, A, ARBOR REALTY TRUST INC.
20141224, 038923603, ABR , PRA, N, A, ARBOR REALTY TRUST INC.
20141224, 038923702, ABR , PRB, N, A, ARBOR REALTY TRUST INC.
20141224, 038923801, ABR , PRC, N, A, ARBOR REALTY TRUST INC.
20141224, 038923884, ABRN, , N, A, ARBOR REALTY TRUST INC
20141224, 002824100, ABT , , N, A, ABBOTT LABORATORIES
20141224, 05275N205, ABTL, , T, C, AUTOBYTEL INC
20141224, 067901108, ABX , , N, A, BARRICK GOLD CORP.
20141224, G00349103, ABY , , T, C, ABENGOA YIELD PLC
20141224, 004225108, ACAD, , T, C, ACADIA PHARMACEUTICALS INC.
20141224, 02503Y103, ACAS, , T, C, AMERICAN CAPITAL LTD
20141224, 039670104, ACAT, , T, C, ARCTIC CAT INC
;

 

MasterMaster

 

So my plan is as follows.

1. Get total number of SYM_ROOT variable from the master and then group into 40 mutually exclusive groups.

2. Based on those groups, implement analysis on the Trade (and Quote) data, 40 times.

3. Then, combine those results.

 

Below is apparently the 1st step. I am only this far... Oh, and I did not know how to split in 40 groups, so I just tried to use arbitrary cutoff=100 to make the groups.

 

proc sql noprint;
select count(SYMBOL_ROOT)
into :N_STOCK
from test_master
where LISTED_MARKET in ('A' 'N' 'T' 'Q') /* AMEX, NYSE, NASDAQ */
and TAPE = 'A' /* Common stock */
; quit;

%let CUTOFF = 100;
%let N_GROUP = %sysfunc(ceil(&N_stock/&CUTOFF));
%let i=1;
%macro Test;
%do i=1 %to &N_GROUP;
proc sql noprint;
select SYMBOL_ROOT
into :STOCKLIST&i
separated by '" "'
from test_master
where LISTED_MARKET in ('A' 'N' 'T' 'Q') /* AMEX, NYSE, NASDAQ */
and TAPE = 'A' /* Common stock */
and monotonic() between &cutoff*(&i-1)+1 and &cutoff*&i
; quit;
%end;
%mend Test;
%Test;

 

I use SAS 9.4 / Windows 10 Version 6.2.9200.

 

Any inputs or comments are appreciated. I'd be truly grateful for any feedbacks.

If any of you think you need more details on the data to help me, please let me know.

Thank you.

 

1 ACCEPTED SOLUTION

Accepted Solutions
ChrisNZ
Tourmaline | Level 20

To create the 40 datasets you want, you'll have to read the whole big date 40 times. And then write the data. And that's before you do any analysis. Unless your analysis requires sorted data, you may not want to do this (and if it does see my previous point and use SPDE).

 

Instead just run the analysis on the whole dataset using a where clause. A very simple (and perhaps sufficient) way to do this would be to take the first letter of SYM_ROOT.

 

where SYM_ROOT=:'A'

where SYM_ROOT=:'B'

etc

to get 26 subsets of varying sizes.

 

This way, an index on SYM_ROOT, if there is one, can be used for small subsets.

 

Note that 

where SYM_ROOT =: 'A'

and

where SYM_ROOT like 'A%'

can use indexes but

where first(SYM_ROOT)='A'

can't.

View solution in original post

17 REPLIES 17
ChrisNZ
Tourmaline | Level 20

I would not create new datasets. Copying data is very wasteful. 

If you must analyse a subset, just point to the original data set with a where clause.

I suspect you may not even need to do that, depending on the analysis, and a class statement might even do the job, allowing you to process the data set in one pass.

 

If you do start copying data around, and if this is a SAS data set, use this data shuffling cost to recreate the dataset in a sorted order, and maybe indexed. Subsequent accesses of subsets will be much faster.

 

If you must create data subsets, I strongly suggest that you create a SPDE library 

libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary;

and that you use this as your work library.

 

 

 

MadQuant
Obsidian | Level 7

Thank you for the comment.

Upon reading your comment, now I see I have not written my point clearly.

So my goal is to do the same analyses on a huge dataset(A) separately for subsets of that dataset(A), due to the size and capacity.

I tried to the analysis on the whole dataset(A), but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately.

 

RahulG
Barite | Level 11

I have created a macro to split data. It takes symbol root as an input and creates dataset with name as "Data_AA". Here AA is your symbol root. It will create separate dataset for every symbol root.

 

I have assume column name as SYMBOL_ROOT in testdata. 

 

 

 

 

%macro split_data(_sym_root);
%put symroot val = &_sym_root;
proc sql;
create table Data_&_sym_root. as
select *
from testdata
where SYM_ROOT ="&_sym_root";
quit;

%mend;
filename Split temp;
data _null_;
set  test_master;
file split;
put "%" "split_data(" SYMBOL_ROOT ");";
run;
%inc Split;

 

Just for your knowledge Split file would have call to the macro something like this.

%split_data(A );
%split_data(C );
%split_data(B );
 
 
You can add your analysis in %split macro or may be you may need to create separate macro with same kind of approach.
Reeza
Super User

Spend some time researching BY group processing, otherwise the rest of your project will end up as macros to process each of your smaller data set. I believe the WRDS server is well powered to handle these large data sets - it's designed specifically for that after all.

MadQuant
Obsidian | Level 7
Thank you for the comment.
I tried to the analysis on the whole dataset, but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately.
So that's why I am trying this way.
I will try to see if BY grouping works.
mkeintz
PROC Star

@MadQuant

 

Question.

  1. What analysis are you trying to do with the trade data?   Which is another way of asking whether you really need to make intermediate data sets.  Or if you do need to make intermediates, you should be sure to create any new variables at the same time.

  2. What is the purpose of the subgroups?  Is it to manage data set size, or is it for analysis purposes?  It appears to be a data set size issue, since your are not using any particular criterion to specify your groups.  I ask this because there may not be any particular reason to divide the data sets.

  3. How are your 40 portfolios being constructed.  Are they simply divided alphabetically by SYM_ROOT/SYM_SUFFIX?   If so you might be able to use BY-group processing in the analysis - alleviating the need to run the analysis 40 times.  Instead one analysis will be run iteratively over 40 by-groups.

  4. I would NOT, repeat NOT, use proc sql (for constructing the 40 subgroups) if you care about the time-ordering of trades within stocks.  PROC SQL does not guarantee preservation of record order (the incoming TAQ data are sorted by SYM_ROOT/SYM_SUFFIX/time).  And yes, you could put an ORDER BY expression in proc sql, but it will require at least one extra unnecessary pass through the data.  Instead you should be using a DATA step.

 

So please provide a more complete description of your research task.  I'm sure we can come up with a relatively efficient solution.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
MadQuant
Obsidian | Level 7

Thank you for the comment.
So my goal is to do the one (or more) analysis (e.g. calculate orderflows and so on) on a huge dataset(A). I tried to the calculation on the whole dataset(A), but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately. Also, the space assigned to individuals on the cloud is only a few hundered megas or at most a few gigas.

 

I was trying to put SYM_ROOT variables separately in the WHERE clause of the main analysis code, which I posted below. The order or dividing method doesn't really matter as long as it covers the whole data.

 

    data DailyTrade;

        /* Enter Trade file names in YYYYMMDD format for the same dates */
        set ct.ctm_20141224;

		/* Enter the same company tickers as above */
        where /* sym_root in ('AAPL','IBM') and */

        /* Retrieve trades during normal market hours */
        (("9:30:00.000000000"t) <= time_m <= ("16:00:00.000000000"t));
        type='T';
        format date date9.;
        format time_m part_time trf_time TIME20.9;
    run;

 

Thanks for the comment on PROC SQL. I wasn't aware of that.

I am relatively more familiar to SQL so that is the reason I was using it...

mkeintz
PROC Star

TAQ doesn't have order-book data, so I assume that by "order flows" you mean trade-volume over rolling windows of some time length.  Is this correct?  If so, there is no need to make intermediate data sets with all trades for a subset of stock symbols.  One can read in all the CT data, and create (say) 1-minute or 5-minute trade volume values, creating a much smaller data set that the original CTM datasets (which currently time-stamps to the nano-second, although the 2014 data might be millisecond or microsecond). 

 

Regarding being cut off from WRDS:  I take it that you are using pc sas/connect to run you program on WRDS.  If you are getting disconnected, it's probably because some network monitor between WRDS and your PC (probably your school) detects no network traffic in your connection.  That monitor likely decided you've abandoned the session and disconnects your machine from WRDS.  If you're using a machine on your campus network, I'd suggest contacting your network support folks to see about a work-around.

 

Also, (assuming again you are using pc sas/connect) if you have multiple DATA steps and PROC steps between an RSUBMIT and ENDRSUBMIT, then block each single step with RSUBMIT/ENDRSUBMIT.  This will cause more frequent traffic on your sas log (proving you haven't abandoned the connections).  And you can select all the rsubmit/endrsubmit blocks for simultaneous submission, so no additional baby sitting is required vs a single multi-step rsubmit/endrsubmit.

 

If none of this work to keep you connected, consider using sas/studio on WRDS.  Perhaps your network monitor will not apply the same criteria for disconnection.

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
MadQuant
Obsidian | Level 7

For order flows, they are net of buy and sell trading volume, so the first step is to identify the sign (plus or minus) of each trade, which is not offered by TAQ. So at least once, I have to do some kind of manipulations at each trade level to assign the signs. So I think that gets back to applying subset of stock symbols right?

 

Regarding being cut off from WRDS, you are correct. Wrapping with RSUBMIT/ENDRSUBMIT seems to be a nice idea. Also, I did not know that SAS/Studio has different criteria for this issue. 

 

Thank you again for your time.

mkeintz
PROC Star

What criteria will you be using to mark a trade as a buyer-driven or seller-driven transaction?

 

I am aware that one can interleave the quotes file with the trade file, then compare the trade price to some lagged (by a short time interval) BID and ASK prices from the corresponding quotes file.  You should note that the quotes file is a lot larger than the trades file, a gift of high frequency trading techniques.  In the case of 20141224, there are 245 million trades, and about 19.7 billion quotes.

 

Or do you intend to assign buy/sell trades based solely on rolling micro-series of trade prices?

 

Regardless,  I suggest you make a sample of, say, 50 stocks, and then develop your entire sequence of operations.  I mean, after you assign buy/sell designations, then what will you be doing?  Aggregating over time periods?   I raise this issue because you can do a sequence of operations in a single DATA step, possibly yielding a much smaller working data set.

 

Then you can start worrying about whether and how to divide a large data set.

 

 

 

You should just take a sample of, say, 100 stocks, and develop a program t

 

 

with TAQ data is to use the trade file in conjunction with the same-day quote file

 

 

You can not mark trades as buys

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
MadQuant
Obsidian | Level 7

Thanks again for your interest.

I am using Lee and Ready (1991), which is common in academia. The basic idea is that if the trade price is larger (smaller) than the midpoint of bid-ask quote, then it is classified as buy (sell) initiated trade.

 

So, yes, I also need quote data to do this. WRDS offers NBBO (National Best Bid and Offer) data, which also needs to be complemented by quote data, to do this. If you are interested in the details, please look at the link below.

http://www.excelmodeling.com/Holden-and-Jacobsen-Daily-TAQ-and-Monthly-TAQ-Code-2017-01-02.zip

 

So, the code for this job is already there, although some customizing is required. 

The biggest problem is implementing this wisely so that WRDS server does not cut me off and individual cloud storage size issue when the result comes out. That's why I am focusing on doing the job on 40-ish subsets and then aggregating later.

 

So far, my conclusion is to use splitted SYM_ROOT to run the same code multiple times.

Patrick
Opal | Level 21

@MadQuant

If your'e using SAS Studio and timeout is the actual issue then read this post (all of it, not only the solution). 

https://communities.sas.com/t5/SAS-Studio/SAS-Studio-Time-out/td-p/257411 

 

With SAS EG (not sure about SAS Studio) there is also a keepalive parameter which prevents network timeouts.

https://communities.sas.com/t5/Administration-and-Deployment/How-Where-do-I-set-the-keepalive-parame...

 

 

MadQuant
Obsidian | Level 7
I am using Base SAS, but if using SAS Studio helps, I would look into it. Thanks.
ChrisNZ
Tourmaline | Level 20

To create the 40 datasets you want, you'll have to read the whole big date 40 times. And then write the data. And that's before you do any analysis. Unless your analysis requires sorted data, you may not want to do this (and if it does see my previous point and use SPDE).

 

Instead just run the analysis on the whole dataset using a where clause. A very simple (and perhaps sufficient) way to do this would be to take the first letter of SYM_ROOT.

 

where SYM_ROOT=:'A'

where SYM_ROOT=:'B'

etc

to get 26 subsets of varying sizes.

 

This way, an index on SYM_ROOT, if there is one, can be used for small subsets.

 

Note that 

where SYM_ROOT =: 'A'

and

where SYM_ROOT like 'A%'

can use indexes but

where first(SYM_ROOT)='A'

can't.

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 17 replies
  • 7663 views
  • 12 likes
  • 6 in conversation