topic Re: Splitting dataset dynamically using macro in SAS Programming

Splitting dataset dynamically using macro

MadQuant — Tue, 24 Oct 2017 04:14:04 GMT

Hi all,

I am new to SAS.

My sample data looks like below. (I couldn't figure out the data step to re-create the sample data, so refer to the screenshot please)

data testdata;
infile datalines dsd;
input DATE TIME_M EX SYM_ROOT SYM_SUFFIX TR_SCOND SIZE PRICE TR_STOPIND TR_CORR TR_SEQNUM TR_SOURCE TR_RF ;
datalines;
20141224, 9:30:01.006, N, A, , O ,11203, 41.4200, N, 00, 1112, C,
20141224, 9:30:02.844, N, A, , I , 17, 41.4300, N, 00, 1564, C,
20141224, 9:30:05.418, D, A, , @ ,100, 41.4300, N, 00, 1770, C, T
20141224, 9:30:05.419, D, A, , @ ,100, 41.4100, N, 00, 1771, C, T
20141224, 9:30:05.424, D, A, , @ ,100, 41.4100, N, 00, 1772, C, T
20141224, 9:30:05.449, D, A, , @ ,100, 41.4200, N, 00, 1780, C, T
20141224, 9:30:05.565, D, A, , @ ,100, 41.4100, N, 00, 1794, C, T
20141224, 9:30:12.057, N, A, , FI, 4, 41.5000, N, 00, 2148, C,
20141224, 9:30:14.066, D, A, , I , 33, 41.4201, N, 00, 2202, C, T
20141224, 9:30:37.812, N, A, , FI, 37, 41.5000, N, 00, 3019, C,
20141224, 9:31:00.744, T, A, , @ ,100, 41.5000, N, 00, 3900, C,
20141224, 9:31:00.744, T, A, , @ ,100, 41.5000, N, 00, 3901, C,
20141224, 9:31:00.744, T, A, , Q ,100, 41.5000, N, 00, 3902, C,
20141224, 9:31:00.744, K, A, , F ,100, 41.5000, N, 00, 3903, C,
20141224, 9:31:00.744, Z, A, , F ,100, 41.5000, N, 00, 3904, C,
20141224, 9:31:00.809, T, A, , FI, 15, 41.5300, N, 00, 3905, C,
20141224, 9:31:00.853, D, A, , @ ,100, 41.5000, N, 00, 3908, C, T
20141224, 9:31:41.154, P, A, , Q ,150, 41.5250, N, 00, 6211, C,
20141224, 9:31:41.154, P, A, , O ,150, 41.5250, N, 00, 6212, C,
20141224, 9:31:41.154, P, A, , F ,150, 41.5250, N, 00, 6213, C,
20141224, 9:31:41.154, T, A, , F ,100, 41.5200, N, 00, 6214, C,
20141224, 9:31:41.154, T, A, , FI, 50, 41.5200, N, 00, 6215, C,
20141224, 9:31:41.155, T, A, , FI, 50, 41.5200, N, 00, 6216, C,
20141224, 9:31:41.155, P, A, , F ,100, 41.5200, N, 00, 6217, C,
20141224, 9:31:41.155, T, A, , FI, 37, 41.5200, N, 00, 6218, C,
20141224, 9:31:41.155, K, A, , F ,100, 41.5200, N, 00, 6219, C,
20141224, 9:31:41.155, P, A, , F ,100, 41.5200, N, 00, 6220, C,
20141224, 9:31:41.155, P, A, , FI, 50, 41.5200, N, 00, 6221, C,
20141224, 9:31:41.155, P, A, , FI, 50, 41.5200, N, 00, 6222, C,
20141224, 9:31:41.155, T, A, , I , 63, 41.5200, N, 00, 6223, C,
20141224, 9:31:41.155, T, A, , @ ,137, 41.5200, N, 00, 6224, C,
20141224, 9:31:41.155, T, A, , FI, 78, 41.5200, N, 00, 6225, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6226, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6227, C,
20141224, 9:31:41.155, P, A, , @ ,100, 41.5100, N, 00, 6228, C,
20141224, 9:31:41.156, K, A, , F ,100, 41.5100, N, 00, 6229, C,
20141224, 9:31:41.156, T, A, , F ,100, 41.5100, N, 00, 6230, C,
20141224, 9:31:41.156, P, A, , FI, 22, 41.5000, N, 00, 6231, C,
20141224, 9:31:41.157, K, A, , FI, 22, 41.5000, N, 00, 6232, C,
20141224, 9:31:41.158, P, A, , FI, 78, 41.5000, N, 00, 6233, C,
20141224, 9:31:41.158, P, A, , F ,100, 41.5000, N, 00, 6234, C,
20141224, 9:31:41.158, T, A, , @ ,100, 41.5000, N, 00, 6235, C,
20141224, 9:31:41.158, T, A, , @ ,100, 41.5000, N, 00, 6236, C,
20141224, 9:31:41.158, T, A, , F ,100, 41.5000, N, 00, 6237, C,
20141224, 9:31:41.158, K, A, , FI, 78, 41.5000, N, 00, 6238, C,
20141224, 9:31:41.158, K, A, , F ,100, 41.5000, N, 00, 6239, C,
20141224, 9:31:41.201, D, A, , @ ,100, 41.5000, N, 00, 6240, C, T
20141224, 9:31:41.229, P, A, , F ,100, 41.4800, N, 00, 6242, C,
20141224, 9:31:41.231, N, A, , F ,100, 41.4600, N, 00, 6244, C,
20141224, 9:31:41.231, T, A, , F ,100, 41.4700, N, 00, 6245, C,
;

Trade data

This is part of WRDS TAQ (Trade and Qoute) data. The entire data set is enormous. For each day, the number of obs is of order of 10 million. So, even working on WRDS cloud server, which is presumably much faster than my local machine, splitting the dataset is mandatory to do any kind of analysis.

I am planning to split the dataset in 40 smaller datasets by ticker symbols of stocks which are SYM_ROOT in this data set.

Now I have a master file for each day that contains distinct SYM_ROOT values. Please see below.

data test_master;
infile datalines dsd;
input DATE CUSIP SYMBOL_ROOT SYMBOL_SUFFIX LISTED_MARKET TAPE SEC_DESC;
datalines;
DATE CUSIP SYMBOL_ROOT SYMBOL_SUFFIX LISTED_MARKET TAPE SEC_DESC
20141224, 00846U101, A , , N, A, AGILENT TECHNOLOGIES INC.
20141224, 013817101, AA , , N, A, ALCOA INC
20141224, 013817200, AA , PR , A, B, ALCOA INC. $3.75 PREFERRED STOCK
20141224, 013817309, AA , PRB, N, A, ALCOA INC.
20141224, 000307108, AAC , , N, A, AAC HOLDINGS INC.
20141224, 00768Y206, AADR, , P, B, WCM / BNY MELLON FOCUSED GROWTH ADR ETF
20141224, 46429B374, AAIT, , T, C, ISHARES MSCI ALL COUNTRY ASIA INFORMATION TECHNOLOGY ETF
20141224, 02376R102, AAL , , T, C, AMERICAN AIRLINES GROUP INC.
20141224, 02153X108, AAMC, , A, B, ALTISOURCE ASSET MGMT CORP
20141224, 048209100, AAME, , T, C, ATLANTIC AMERICAN CORP
20141224, 002535300, AAN , , N, A, AARON'S INC.
20141224, 03823U102, AAOI, , T, C, APPLIED OPTOELECTRONICS INC.
20141224, 000360206, AAON, , T, C, AAON INC
20141224, 00751Y106, AAP , , N, A, ADVANCE AUTO PARTS INC
20141224, 037833100, AAPL, , T, C, APPLE INC.
20141224, 024013104, AAT , , N, A, AMERICAN ASSETS TRUST INC.
20141224, 020283107, AAU , , A, B, ALMADEN MINERALS LTD.
20141224, 00765F101, AAV , , N, A, ADVANTAGE OIL & GAS LTD.
20141224, 05337G107, AAVL, , T, C, AVALANCHE BIOTECHNOLOGIES, INC
20141224, 049164205, AAWW, , T, C, ATLAS AIR WORLDWIDE HOLDINGS INC.
20141224, 464288182, AAXJ, , T, C, ISHARES MSCI ALL COUNTRY ASIA EX JAPAN ETF
20141224, 01881G106, AB , , N, A, ALLIANCEBERNSTEIN HOLDING L.P.
20141224, G0404E104, ABAC, , T, C, AOXIN TIANLI GROUP INC.
20141224, 002567105, ABAX, , T, C, ABAXIS INC
20141224, 000375204, ABB , , N, A, ABB LTD.
20141224, 00287Y109, ABBV, , N, A, ABBVIE INC.
20141224, 03073E105, ABC , , N, A, AMERISOURCEBERGEN CORP.
20141224, 03076K108, ABCB, , T, C, AMERIS BANCORP
20141224, 13201A107, ABCD, , T, C, CAMBIUM LEARNING GROUP INC.
20141224, 00762W107, ABCO, , T, C, ADVISORY BOARD CO/THE
20141224, 03283P106, ABCW, , T, C, ANCHOR BANCORP WISCONSIN INC
20141224, 01374T102, ABDC, , T, C, ALCENTRA CAPITAL CORP
20141224, 02319V103, ABEV, , N, A, AMBEV S.A.
20141224, 043436104, ABG , , N, A, ASBURY AUTO GROUP AB ADS
20141224, 00289R201, ABGB, , T, C, ABENGOA, S.A. AMERICAN DEPOSITARY SHARES
20141224, 00211Y209, ABIO, , T, C, ARCA BIOPHARMA INC.
20141224, 000957100, ABM , , N, A, ABM INDUSTRIES INC.
20141224, 003654100, ABMD, , T, C, ABIOMED INC
20141224, 038923108, ABR , , N, A, ARBOR REALTY TRUST INC.
20141224, 038923603, ABR , PRA, N, A, ARBOR REALTY TRUST INC.
20141224, 038923702, ABR , PRB, N, A, ARBOR REALTY TRUST INC.
20141224, 038923801, ABR , PRC, N, A, ARBOR REALTY TRUST INC.
20141224, 038923884, ABRN, , N, A, ARBOR REALTY TRUST INC
20141224, 002824100, ABT , , N, A, ABBOTT LABORATORIES
20141224, 05275N205, ABTL, , T, C, AUTOBYTEL INC
20141224, 067901108, ABX , , N, A, BARRICK GOLD CORP.
20141224, G00349103, ABY , , T, C, ABENGOA YIELD PLC
20141224, 004225108, ACAD, , T, C, ACADIA PHARMACEUTICALS INC.
20141224, 02503Y103, ACAS, , T, C, AMERICAN CAPITAL LTD
20141224, 039670104, ACAT, , T, C, ARCTIC CAT INC
;

Master

So my plan is as follows.

1. Get total number of SYM_ROOT variable from the master and then group into 40 mutually exclusive groups.

2. Based on those groups, implement analysis on the Trade (and Quote) data, 40 times.

3. Then, combine those results.

Below is apparently the 1st step. I am only this far... Oh, and I did not know how to split in 40 groups, so I just tried to use arbitrary cutoff=100 to make the groups.

proc sql noprint;
select count(SYMBOL_ROOT)
into :N_STOCK
from test_master
where LISTED_MARKET in ('A' 'N' 'T' 'Q') /* AMEX, NYSE, NASDAQ */
and TAPE = 'A' /* Common stock */
; quit;

%let CUTOFF = 100;
%let N_GROUP = %sysfunc(ceil(&N_stock/&CUTOFF));
%let i=1;
%macro Test;
%do i=1 %to &N_GROUP;
proc sql noprint;
select SYMBOL_ROOT
into :STOCKLIST&i
separated by '" "'
from test_master
where LISTED_MARKET in ('A' 'N' 'T' 'Q') /* AMEX, NYSE, NASDAQ */
and TAPE = 'A' /* Common stock */
and monotonic() between &cutoff*(&i-1)+1 and &cutoff*&i
; quit;
%end;
%mend Test;
%Test;

I use SAS 9.4 / Windows 10 Version 6.2.9200.

Any inputs or comments are appreciated. I'd be truly grateful for any feedbacks.

If any of you think you need more details on the data to help me, please let me know.

Thank you.

Re: Splitting dataset dynamically using macro

ChrisNZ — Tue, 24 Oct 2017 05:55:30 GMT

I would not create new datasets. Copying data is very wasteful.

If you must analyse a subset, just point to the original data set with a where clause.

I suspect you may not even need to do that, depending on the analysis, and a class statement might even do the job, allowing you to process the data set in one pass.

If you do start copying data around, and if this is a SAS data set, use this data shuffling cost to recreate the dataset in a sorted order, and maybe indexed. Subsequent accesses of subsets will be much faster.

If you must create data subsets, I strongly suggest that you create a SPDE library

libname SPEEDY spde "%sysfunc(pathname(WORK))" compress=binary;

and that you use this as your work library.

Re: Splitting dataset dynamically using macro

RahulG — Tue, 24 Oct 2017 05:55:50 GMT

I have created a macro to split data. It takes symbol root as an input and creates dataset with name as "Data_AA". Here AA is your symbol root. It will create separate dataset for every symbol root.

I have assume column name as SYMBOL_ROOT in testdata.

%macro split_data(_sym_root);
%put symroot val = &_sym_root;
proc sql;
create table Data_&_sym_root. as
select *
from testdata
where SYM_ROOT ="&_sym_root";
quit;

%mend;
filename Split temp;
data _null_;
set  test_master;
file split;
put "%" "split_data(" SYMBOL_ROOT ");";
run;
%inc Split;

Just for your knowledge Split file would have call to the macro something like this.

%split_data(A );

%split_data(C );

%split_data(B );

You can add your analysis in %split macro or may be you may need to create separate macro with same kind of approach.

Re: Splitting dataset dynamically using macro

Reeza — Tue, 24 Oct 2017 14:21:48 GMT

Spend some time researching BY group processing, otherwise the rest of your project will end up as macros to process each of your smaller data set. I believe the WRDS server is well powered to handle these large data sets - it's designed specifically for that after all.

Re: Splitting dataset dynamically using macro

mkeintz — Tue, 24 Oct 2017 16:56:52 GMT

@MadQuant

Question.

What analysis are you trying to do with the trade data? Which is another way of asking whether you really need to make intermediate data sets. Or if you do need to make intermediates, you should be sure to create any new variables at the same time.
What is the purpose of the subgroups? Is it to manage data set size, or is it for analysis purposes? It appears to be a data set size issue, since your are not using any particular criterion to specify your groups. I ask this because there may not be any particular reason to divide the data sets.
How are your 40 portfolios being constructed. Are they simply divided alphabetically by SYM_ROOT/SYM_SUFFIX? If so you might be able to use BY-group processing in the analysis - alleviating the need to run the analysis 40 times. Instead one analysis will be run iteratively over 40 by-groups.
I would NOT, repeat NOT, use proc sql (for constructing the 40 subgroups) if you care about the time-ordering of trades within stocks. PROC SQL does not guarantee preservation of record order (the incoming TAQ data are sorted by SYM_ROOT/SYM_SUFFIX/time). And yes, you could put an ORDER BY expression in proc sql, but it will require at least one extra unnecessary pass through the data. Instead you should be using a DATA step.

So please provide a more complete description of your research task. I'm sure we can come up with a relatively efficient solution.

Re: Splitting dataset dynamically using macro

MadQuant — Tue, 24 Oct 2017 17:38:49 GMT

Thank you for the comment.

Upon reading your comment, now I see I have not written my point clearly.

So my goal is to do the same analyses on a huge dataset(A) separately for subsets of that dataset(A), due to the size and capacity.

I tried to the analysis on the whole dataset(A), but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately.

Re: Splitting dataset dynamically using macro

MadQuant — Tue, 24 Oct 2017 17:40:57 GMT

Thank you for the comment.
I tried to the analysis on the whole dataset, but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately.
So that's why I am trying this way.
I will try to see if BY grouping works.

Re: Splitting dataset dynamically using macro

MadQuant — Tue, 24 Oct 2017 17:51:59 GMT

Thank you for the comment.
So my goal is to do the one (or more) analysis (e.g. calculate orderflows and so on) on a huge dataset(A). I tried to the calculation on the whole dataset(A), but the WRDS server disconnected me after couple of hours, and I heard that I have to do it separately. Also, the space assigned to individuals on the cloud is only a few hundered megas or at most a few gigas.

I was trying to put SYM_ROOT variables separately in the WHERE clause of the main analysis code, which I posted below. The order or dividing method doesn't really matter as long as it covers the whole data.

    data DailyTrade;

        /* Enter Trade file names in YYYYMMDD format for the same dates */
        set ct.ctm_20141224;

		/* Enter the same company tickers as above */
        where /* sym_root in ('AAPL','IBM') and */

        /* Retrieve trades during normal market hours */
        (("9:30:00.000000000"t) <= time_m <= ("16:00:00.000000000"t));
        type='T';
        format date date9.;
        format time_m part_time trf_time TIME20.9;
    run;

Thanks for the comment on PROC SQL. I wasn't aware of that.

I am relatively more familiar to SQL so that is the reason I was using it...

Re: Splitting dataset dynamically using macro

mkeintz — Tue, 24 Oct 2017 19:14:05 GMT

TAQ doesn't have order-book data, so I assume that by "order flows" you mean trade-volume over rolling windows of some time length. Is this correct? If so, there is no need to make intermediate data sets with all trades for a subset of stock symbols. One can read in all the CT data, and create (say) 1-minute or 5-minute trade volume values, creating a much smaller data set that the original CTM datasets (which currently time-stamps to the nano-second, although the 2014 data might be millisecond or microsecond).

Regarding being cut off from WRDS: I take it that you are using pc sas/connect to run you program on WRDS. If you are getting disconnected, it's probably because some network monitor between WRDS and your PC (probably your school) detects no network traffic in your connection. That monitor likely decided you've abandoned the session and disconnects your machine from WRDS. If you're using a machine on your campus network, I'd suggest contacting your network support folks to see about a work-around.

Also, (assuming again you are using pc sas/connect) if you have multiple DATA steps and PROC steps between an RSUBMIT and ENDRSUBMIT, then block each single step with RSUBMIT/ENDRSUBMIT. This will cause more frequent traffic on your sas log (proving you haven't abandoned the connections). And you can select all the rsubmit/endrsubmit blocks for simultaneous submission, so no additional baby sitting is required vs a single multi-step rsubmit/endrsubmit.

If none of this work to keep you connected, consider using sas/studio on WRDS. Perhaps your network monitor will not apply the same criteria for disconnection.

Re: Splitting dataset dynamically using macro

MadQuant — Tue, 24 Oct 2017 20:13:22 GMT

For order flows, they are net of buy and sell trading volume, so the first step is to identify the sign (plus or minus) of each trade, which is not offered by TAQ. So at least once, I have to do some kind of manipulations at each trade level to assign the signs. So I think that gets back to applying subset of stock symbols right?

Regarding being cut off from WRDS, you are correct. Wrapping with RSUBMIT/ENDRSUBMIT seems to be a nice idea. Also, I did not know that SAS/Studio has different criteria for this issue.

Thank you again for your time.

Re: Splitting dataset dynamically using macro

mkeintz — Tue, 24 Oct 2017 23:56:36 GMT

What criteria will you be using to mark a trade as a buyer-driven or seller-driven transaction?

I am aware that one can interleave the quotes file with the trade file, then compare the trade price to some lagged (by a short time interval) BID and ASK prices from the corresponding quotes file. You should note that the quotes file is a lot larger than the trades file, a gift of high frequency trading techniques. In the case of 20141224, there are 245 million trades, and about 19.7 billion quotes.

Or do you intend to assign buy/sell trades based solely on rolling micro-series of trade prices?

Regardless, I suggest you make a sample of, say, 50 stocks, and then develop your entire sequence of operations. I mean, after you assign buy/sell designations, then what will you be doing? Aggregating over time periods? I raise this issue because you can do a sequence of operations in a single DATA step, possibly yielding a much smaller working data set.

Then you can start worrying about whether and how to divide a large data set.

You should just take a sample of, say, 100 stocks, and develop a program t

with TAQ data is to use the trade file in conjunction with the same-day quote file

You can not mark trades as buys

Re: Splitting dataset dynamically using macro

MadQuant — Wed, 25 Oct 2017 22:36:57 GMT

Thanks again for your interest.

I am using Lee and Ready (1991), which is common in academia. The basic idea is that if the trade price is larger (smaller) than the midpoint of bid-ask quote, then it is classified as buy (sell) initiated trade.

So, yes, I also need quote data to do this. WRDS offers NBBO (National Best Bid and Offer) data, which also needs to be complemented by quote data, to do this. If you are interested in the details, please look at the link below.

http://www.excelmodeling.com/Holden-and-Jacobsen-Daily-TAQ-and-Monthly-TAQ-Code-2017-01-02.zip

So, the code for this job is already there, although some customizing is required.

The biggest problem is implementing this wisely so that WRDS server does not cut me off and individual cloud storage size issue when the result comes out. That's why I am focusing on doing the job on 40-ish subsets and then aggregating later.

So far, my conclusion is to use splitted SYM_ROOT to run the same code multiple times.

Re: Splitting dataset dynamically using macro

Patrick — Thu, 26 Oct 2017 00:00:13 GMT

@MadQuant

If your'e using SAS Studio and timeout is the actual issue then read this post (all of it, not only the solution).

https://communities.sas.com/t5/SAS-Studio/SAS-Studio-Time-out/td-p/257411

With SAS EG (not sure about SAS Studio) there is also a keepalive parameter which prevents network timeouts.

https://communities.sas.com/t5/Administration-and-Deployment/How-Where-do-I-set-the-keepalive-parameter-to-prevent-network/td-p/235250

Re: Splitting dataset dynamically using macro

ChrisNZ — Thu, 26 Oct 2017 04:34:10 GMT

To create the 40 datasets you want, you'll have to read the whole big date 40 times. And then write the data. And that's before you do any analysis. Unless your analysis requires sorted data, you may not want to do this (and if it does see my previous point and use SPDE).

Instead just run the analysis on the whole dataset using a where clause. A very simple (and perhaps sufficient) way to do this would be to take the first letter of SYM_ROOT.

where SYM_ROOT=:'A'

where SYM_ROOT=:'B'

etc

to get 26 subsets of varying sizes.

This way, an index on SYM_ROOT, if there is one, can be used for small subsets.

Note that

where SYM_ROOT =: 'A'

and

where SYM_ROOT like 'A%'

can use indexes but

where first(SYM_ROOT)='A'

can't.

Re: Splitting dataset dynamically using macro

MadQuant — Fri, 27 Oct 2017 13:26:20 GMT

I am using Base SAS, but if using SAS Studio helps, I would look into it. Thanks.

Re: Splitting dataset dynamically using macro

MadQuant — Fri, 27 Oct 2017 21:42:10 GMT

Ah, this is it!

Now I should change my question.

How do I assign the 4000-ish SYM_ROOTs dynamically into, say, 40 equal macro variables to use it in WHERE clause?

So instead of using

where SYM_ROOT=:'A'

where SYM_ROOT=:'B'

etc

which would vary in size, I would like to have equal subset of SYM_ROOT to put it in the WHERE clause of my main analysis code. Each day would be different, so it has to depend on the total number of obs of SYM_ROOT on daily bases.

I searched a few postings but mostly they rely on additional group identifiers. In my case there aren't any.

Re: Splitting dataset dynamically using macro

Reeza — Fri, 27 Oct 2017 21:49:47 GMT

Take the total number of obs, from sashelp.vtable, divide it by 40 and split your dataset into your 40 subsets by the number of records.

If you need to keep the groups together, then consider seeing if you can tweak a PROC SURVEYSELECT to do it for you.

Re: Splitting dataset dynamically using macro

MadQuant — Fri, 27 Oct 2017 21:51:46 GMT

Thanks. I wasn't aware of PROC SURVEYSELECT, but I will look into it.