Hi,
I have a large dataset of trading data that looks like this:
data have;
infile datalines truncover dlm=',' dsd;
input cusip:$char9. trade_date:mmddyy10. execution_time:time8. buy_sell:$char1. (quantity execution_price trade_id) (:best32.);
format execution_time time8. trade_date mmddyy10.;
datalines;
010831BW4,09/12/2013,12:30:44,B,45,103.133,1325534595
010831BW4,09/12/2013,14:56:13,B,500,103.303,1325536968
010831BW4,09/17/2013,13:14:35,S,25,105.244,1326014939
010831BW4,09/17/2013,15:40:12,S,50,105.369,1326017794
010831BW4,09/17/2013,16:04:21,S,25,105.244,1326018222
010831BW4,09/18/2013,10:12:59,S,50,105.369,1326121154
010831BW4,09/18/2013,11:50:21,S,350,105.015,1326123005
010831BW4,09/20/2013,11:00:41,B,50,106.308,1326341521
010831BW4,09/20/2013,11:00:41,S,50,106.308,1326341522
010831BY0,09/12/2013,12:30:44,B,1075,99.592,1325534596
010831BY0,09/12/2013,16:51:56,B,350,99.777,1325547689
010831BY0,09/13/2013,11:01:07,S,20,100.25,1325642219
010831BY0,09/13/2013,11:03:05,S,330,100.44,1325642259
010831BY0,10/02/2013,11:07:37,B,50,101.117,1327522086
010831BY0,10/03/2013,11:54:05,S,50,104.218,1327632245
;
run;
I am trying to match the buy observations with the sell observations (distinguished by the Buy_Sell variable) within each CUSIP using a Last-in Last-out methodology. In other words, each time there is a sale in a CUSIP, I want the quantity most recently purchased in that CUSIP matched to the sale. If more is bought than sold (or sold than bought), then after the final buy or sell observation (whichever comes first) I want the extra quantity discarded and for the program to move on to the next CUSIP. I want my final table to look like this:
CUSIP | Quantity | b_id | bdate | btime | bprice | s_id | sdate | stime | sprice |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326121154 | 18-Sep-13 | 10:12:59 | 105.369 |
010831BW4 | 350 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326123005 | 18-Sep-13 | 11:50:21 | 105.015 |
010831BW4 | 50 | 1326341521 | 20-Sep-13 | 11:00:41 | 106.308 | 1326341522 | 20-Sep-13 | 11:00:41 | 106.308 |
010831BY0 | 20 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642219 | 13-Sep-13 | 11:01:07 | 100.25 |
010831BY0 | 330 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY1 | 50 | 1327522086 | 2-Oct-13 | 11:07:37 | 101.117 | 1327632245 | 3-Oct-13 | 11:54:05 | 104.218 |
In order to achieve this, I am using a hash of hashes that changes for each CUSIP and is made up of hash tables and hash iterators for the buy and sell observations. Each time there is a new CUSIP, the two hash tables take all of the buy and sell observations and output them together based on the order that they were brought into the hash table and the quantities that they have. Once the hash tables run out of buy or sell observations in a CUSIP, the additional observations are discarded and the the program moves on to the next CUSIP. Here is my current code:
data want;
set have;
if _n_=1 then do;
declare hash hoh ();
hoh.definekey('cusip');
hoh.definedata('cusip','buys','sells','buy','sell');
hoh.definedone();
declare hash buys;
declare hash sells;
declare hiter buy;
declare hiter sell;
end;
if hoh.find() ne 0 then do;
length tid 8 b_id bdate btime bprice bquantity quantity s_id sdate stime sprice squantity 8;
buys=_new_ hash(ordered:'a');
buys.defineKey('tid');
buys.defineData('bdate','btime','bprice','bquantity','b_id', 'b_account', 'b_source', 'b_cusip', 'in14');
buys.defineDone();
sells=_new_ hash(ordered:'a');
sells.defineKey('tid');
sells.defineData('sdate','stime','sprice','squantity','s_id','s_account', 's_cusip');
sells.defineDone();
buy=_new_ hiter('buys');
sell=_new_ hiter('sells');
inventory=0;
end;
tid+1;
if buy_sell="B" then do;
bdate=trade_date;
btime=execution_time;
bprice=execution_price;
bquantity=quantity;
b_id=trade_id;
b_account=account_number2;
b_source=source;
b_cusip=cusip;
in14=in14;
inventory+quantity;
buys.add();
end;
else do;
if inventory=0 then delete;
if quantity>inventory then quantity=inventory;
sdate=trade_date;
stime=execution_time;
sprice=execution_price;
squantity=quantity;
s_id=trade_id;
s_account=account_number2;
s_cusip=cusip;
inventory+(-1*quantity);
sells.add();
end;
keep bdate b_account b_source b_cusip b_id bdate btime bprice quantity sdate stime sprice s_id s_account s_cusip in14;
format bdate sdate date9. bprice sprice dollar8.4 btime stime time12.;
brc=buy.last();
src=sell.first();
do while(brc=0 and src=0);
select;
when (bquantity < squantity) do;
temp=squantity-bquantity;
quantity=bquantity;
output;
squantity=temp;
brc=buy.prev();
end;
when (bquantity=squantity) do;
quantity=bquantity;
output;
brc=buy.prev();
src=sell.next();
end;
when (bquantity > squantity) do;
temp=bquantity-squantity;
quantity=squantity;
output;
bquantity=temp;
src=sell.next();
end;
otherwise;
end;
end;
rc=hoh.replace();
run;
When I run this code, the output is all over the place. Usually the output begins correct in each CUSIP, and gets further off as it goes along. It gets especially bad when more is bought in a CUSIP than sold and some quantity needs to be discarded. The output also contains many observations that are full duplicates of previous observations, and uses observations even after their quantity has been used up and the hash table was supposed to move on to the next observation. For the dataset above, this is the output that my code gives:
CUSIP | Quantity | b_id | bdate | btime | bprice | s_id | sdate | stime | sprice |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326121154 | 18-Sep-13 | 10:12:59 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326121154 | 18-Sep-13 | 10:12:59 | 105.369 |
010831BW4 | 350 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326123005 | 18-Sep-13 | 11:50:21 | 105.015 |
010831BW4 | 25 | 1326341521 | 20-Sep-13 | 11:00:41 | 106.308 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 25 | 1326341521 | 20-Sep-13 | 11:00:41 | 106.308 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326121154 | 18-Sep-13 | 10:12:59 | 105.369 |
010831BW4 | 350 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326123005 | 18-Sep-13 | 11:50:21 | 105.015 |
010831BW4 | 25 | 1326341521 | 20-Sep-13 | 11:00:41 | 106.308 | 1326014939 | 17-Sep-13 | 13:14:35 | 105.244 |
010831BW4 | 25 | 1326341521 | 20-Sep-13 | 11:00:41 | 106.308 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326017794 | 17-Sep-13 | 15:40:12 | 105.369 |
010831BW4 | 25 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326018222 | 17-Sep-13 | 16:04:21 | 105.244 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326121154 | 18-Sep-13 | 10:12:59 | 105.369 |
010831BW4 | 350 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326123005 | 18-Sep-13 | 11:50:21 | 105.015 |
010831BW4 | 50 | 1325536968 | 12-Sep-13 | 14:56:13 | 103.303 | 1326341522 | 20-Sep-13 | 11:00:41 | 106.308 |
010831BY0 | 20 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642219 | 13-Sep-13 | 11:01:07 | 100.25 |
010831BY0 | 20 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642219 | 13-Sep-13 | 11:01:07 | 100.25 |
010831BY0 | 330 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY0 | 20 | 1327522086 | 2-Oct-13 | 11:07:37 | 101.117 | 1325642219 | 13-Sep-13 | 11:01:07 | 100.25 |
010831BY0 | 30 | 1327522086 | 2-Oct-13 | 11:07:37 | 101.117 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY0 | 300 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY0 | 20 | 1327522086 | 2-Oct-13 | 11:07:37 | 101.117 | 1325642219 | 13-Sep-13 | 11:01:07 | 100.25 |
010831BY0 | 30 | 1327522086 | 2-Oct-13 | 11:07:37 | 101.117 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY0 | 300 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1325642259 | 13-Sep-13 | 11:03:05 | 100.44 |
010831BY0 | 50 | 1325547689 | 12-Sep-13 | 16:51:56 | $99.78 | 1327632245 | 3-Oct-13 | 11:54:05 | 104.218 |
As you can see, it has many more observations than it should - some full duplicates of previous observations, some that are repeats of previous values on only the Buy or Sell side. I think the problem is somewhere in how I iterate through my hash table, but I have been unable to find the bug. Thank you for reading through and please let me know if you can figure out what I am doing wrong!
You can use the same program I provided in your Calculating FIFO Profit using Hash of Hashes. And just like in that topic, I would suggest not using the hash-of-hashes approach. Instead use a pair of hashes, one for a profile of each cusip, and the other to track stock lots.
I suspect these revisions would be sufficient for LIFO
Instead of
do rc=ilots.setcur(key:cusip,key:1) by 0 until(total_shrs_to_sell=0 or lot_cusip^=cusip or rc^=0);
which finds the oldest lot of stocks, you can use
do rc=ilots.setcur(key:cusip,key:n_buys) by 0 until(total_shrs_to_sell=0 or lot_cusip^=cusip or rc^=0);
which finds the newest.
At the bottom of the loop replace
rc=ilots.next();
with
rc=ilots.prev();
Also to accommodate partial lot sales, then instead of
b_num=b_num+.01; /*Create a "sublot" (=lot+.01) with remaining avaiable shrs*/
you can use
b_num=b_num-.01; /*Create a "sublot" (=lot-.01) with remaining avaiable shrs*/
In the LIFO case, it will have the counterintuitive property of, say, LOT 3 having sublots 2.99, 2.98, etc. But that seems relatively acceptable to maintain the same program structure.
Editted additional note: Of course, if your dataset is small enough you can use the other program without any changes by pre-sorting the input dataset by descending date (it doesn't need to also be sorted by cusip).
You can use the same program I provided in your Calculating FIFO Profit using Hash of Hashes. And just like in that topic, I would suggest not using the hash-of-hashes approach. Instead use a pair of hashes, one for a profile of each cusip, and the other to track stock lots.
I suspect these revisions would be sufficient for LIFO
Instead of
do rc=ilots.setcur(key:cusip,key:1) by 0 until(total_shrs_to_sell=0 or lot_cusip^=cusip or rc^=0);
which finds the oldest lot of stocks, you can use
do rc=ilots.setcur(key:cusip,key:n_buys) by 0 until(total_shrs_to_sell=0 or lot_cusip^=cusip or rc^=0);
which finds the newest.
At the bottom of the loop replace
rc=ilots.next();
with
rc=ilots.prev();
Also to accommodate partial lot sales, then instead of
b_num=b_num+.01; /*Create a "sublot" (=lot+.01) with remaining avaiable shrs*/
you can use
b_num=b_num-.01; /*Create a "sublot" (=lot-.01) with remaining avaiable shrs*/
In the LIFO case, it will have the counterintuitive property of, say, LOT 3 having sublots 2.99, 2.98, etc. But that seems relatively acceptable to maintain the same program structure.
Editted additional note: Of course, if your dataset is small enough you can use the other program without any changes by pre-sorting the input dataset by descending date (it doesn't need to also be sorted by cusip).
Thank you so much! I had come up with solutions for FIFO (less efficient than yours and I wasn't confident about them) but I could not figure out LIFO. You've saved me a huge amount of time!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.