I am trying to replicate data cleaing steps, which are only available as SAS code and I am trying to understand, what the code actually does. This is part of code, which was used to clean TRACE data and was published by Jens **bleep**-Nielsen on SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2337908
I am struggling to understand the following part of the code, which are on p. 15-16 in the pdf. It basically is a merge operation and I would like to know, what the commands (in=qqq) and (in=qq) do. Are they some kind of internal ranking system, like Row_Number() in TSQL or what do they do.
* Take out reversals into a dataset;
data reversal temp_raw3;
set temp_raw2;
N=_N_;
if asof_cd=’R’ then output reversal;
else output temp_raw3;
run;
* Sorting the data so that it can be merged;
proc sort data=reversal (drop = N) nodupkey; by trd_exctn_dt
cusip_id trd_exctn_tm rptd_pr entrd_vol_qt rpt_side_cd cntra_mp_id
trd_rpt_dt trd_rpt_tm MSG_SEQ_NB; run;
proc sort data=temp_raw3; by trd_exctn_dt cusip_id trd_exctn_tm
rptd_pr entrd_vol_qt rpt_side_cd cntra_mp_id; run;
* Merges reversals back on and selects matching observations;
data reversal2;
merge temp_raw3 (in=qqq) reversal (in=qq) ;
by trd_exctn_dt cusip_id trd_exctn_tm rptd_pr
entrd_vol_qt rpt_side_cd cntra_mp_id;
if qq=1;
if qqq=1;
* Reversal have to be on a later date
15
* (or else it would not be a reversal);
* i.e. we do not delete potential as_of trades
* from a later date that may match;
if trd_exctn_dt < trd_rpt_dt;
run;
* Selects only 1 matching reversal (and keeps the rest);
proc sort data=reversal2 nodupkey; by trd_exctn_dt bond_sym_id
trd_exctn_tm rptd_pr entrd_vol_qt; run;
proc sort data=reversal2; by N; run;
proc sort data=temp_raw3; by N; run;
* Deletes the macthing reversals;
data temp_raw4;
merge reversal2 (in=qq) temp_raw3;
by N;
if qq=0;
run;
* Ends the filter for PRE-change data;
I hope this is pretty easy to solve for anyone who has a certain knowledge of SAS and sorry if this might be off-topic.
Thanks a lot.
The dataset option in= creates an alias flag indicating whether data from that particular observation appeared in the alias dataset. So a simple example:
data table1; id=1; val=2; output id=2; val=5; output; run; data table2; id=1; val2=7; output; id=3; val2=3; output; run; data table3; merge table1 (in=a) table2 (in=b); by id; run;
Will give (and note that the alias, like any temporary variable is dropped before writing out the dataset):
id val1 val2 a b
1 2 7 1 1
2 5 1 0
3 3 0 1
So the second obs only appears in table 1, so the flag for alias b is set to 0 and a to 1, to indicate data only taken from the first table. The reverse is true for the third observation. In the first observation, data appears in both tables so both alias flags are set to 1.
The dataset option in= creates an alias flag indicating whether data from that particular observation appeared in the alias dataset. So a simple example:
data table1; id=1; val=2; output id=2; val=5; output; run; data table2; id=1; val2=7; output; id=3; val2=3; output; run; data table3; merge table1 (in=a) table2 (in=b); by id; run;
Will give (and note that the alias, like any temporary variable is dropped before writing out the dataset):
id val1 val2 a b
1 2 7 1 1
2 5 1 0
3 3 0 1
So the second obs only appears in table 1, so the flag for alias b is set to 0 and a to 1, to indicate data only taken from the first table. The reverse is true for the third observation. In the first observation, data appears in both tables so both alias flags are set to 1.
* Merges reversals back on and selects matching observations; data reversal2; merge temp_raw3 (in=qqq) reversal (in=qq) ; by trd_exctn_dt cusip_id trd_exctn_tm rptd_pr entrd_vol_qt rpt_side_cd cntra_mp_id; if qq=1; if qqq=1;
So this is basically just an inner join, by only selecting rows, which are part of both datasets, correct?
Thanks a lot for your help.
Yes, it will only keep when in both incoming datasets.
Instead of three subsetting ifs in the data reversal2 step, it can all be put into one, and you can make use of the fact that numerical missing and zero are false, everything else is true:
data
reversal
temp_raw3
;
set temp_raw2;
N = _N_;
if asof_cd = ’R’
then output reversal;
else output temp_raw3;
run;
* Sorting the data so that it can be merged;
proc sort data=reversal (drop = N) nodupkey;
by
trd_exctn_dt cusip_id trd_exctn_tm rptd_pr
entrd_vol_qt rpt_side_cd cntra_mp_id
trd_rpt_dt trd_rpt_tm MSG_SEQ_NB
;
run;
proc sort data=temp_raw3;
by
trd_exctn_dt cusip_id trd_exctn_tm rptd_pr
entrd_vol_qt rpt_side_cd cntra_mp_id
;
run;
* Merges reversals back on and selects matching observations;
data reversal2;
merge
temp_raw3 (in=qqq)
reversal (in=qq)
;
by
trd_exctn_dt cusip_id trd_exctn_tm rptd_pr
entrd_vol_qt rpt_side_cd cntra_mp_id
;
if qq and qqq and trd_exctn_dt < trd_rpt_dt; /* see here! */
* Reversal have to be on a later date
* (or else it would not be a reversal);
* i.e. we do not delete potential as_of trades
* from a later date that may match;
run;
* Selects only 1 matching reversal (and keeps the rest);
proc sort data=reversal2 nodupkey;
by trd_exctn_dt bond_sym_id trd_exctn_tm rptd_pr entrd_vol_qt;
run;
proc sort data=reversal2;
by N;
run;
proc sort data=temp_raw3;
by N;
run;
* Deletes the macthing reversals;
data temp_raw4;
merge
reversal2 (in=qq)
temp_raw3
;
by N;
if not qq;
run;
Note how visual formatting goes a long way towards making code intelligible.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.