Hi I've encountered the same problem recently and I've found several ways in doing that. They are all effective but the most efficient one I found is as follow. It generates correct results for billions of transactions within minutes. The PROC SQL is effective but it takes enormous time to execute should you have a huge dataset. So I figure out a way to do it only using DATA step. The problem is to find the nearest midpoint after 5min for every transaction. Suppose you have the following dataset: Data transactions;
input PRICE MIDPT TIME : anydttme.;
format time time.;
cards;
18.2 18.43 10:00:53
18.41 18.405 10:07:05
18.38 18.375 10:10:14
18.42 18.395 10:13:53
;
run; The code below will generate a dataset named FinalMatchedResult where the nearest MIDPT after 5min for each transaction is added as a new variable MIDPT_5. /*
The transactions are sorted according to TIME.
Add an ID to each transaction.
*/
DATA WORK.transactions;
set WORK.transactions;
ID + 1;
RUN;
/*
Create a new dataset where each transaction is delayed by 5min (300s).
Keep record of the ID in the original dataset.
*/
DATA WORK.transactions2;
set WORK.transactions;
TIME = TIME + 300;
rename ID = OriginalID;
RUN;
/*
Stack the two datasets together and sort them by TIME, ID and OriginalID.
After these two steps, you will have a dataset where the nearst MIDPT after 5min
for each transaction is simply the first obs that has non-missing OriginalID.
This finding is especially important, and you should notice such result in the
output WORK.transaction3 sorted.
*/
DATA WORK.transactions3;
set WORK.transactions WORK.transactions2;
RUN;
PROC sort data= WORK.transactions3;
by TIME ID OriginalID;
RUN;
/*
Based on previous result, delete all irrelavent transactions.
*/
DATA WORK.transactions3_matched (drop=lastObs);
set transactions3;
lastObs = lag(OriginalID);
if OriginalID =. and lastObs =. then delete;
RUN;
/*
One MIDPT maybe matched to several past transactions 5min ago.
Luckily these transactions' positions in the dataset are known, start to _N_ -1,
where _N_ is the row number of current MIDPT.
The matched results are saved in WORK.transactions3_matched.
*/
DATA WORK.matched_result;
set WORK.transactions3_matched;
retain start 1;
if OriginalID =. then do;
MIDPT_5 = MIDPT;
do i = start to _N_ - 1;
set transactions3_matched point= i;
output;
end;
start = _N_ + 1;
end;
RUN;
/*
WORK.FinalMatchedResult saves all nearest MIDPT 5min later for every transaction.
You could double check the dataset to make sure you have the result right.
*/
DATA WORK.FinalMatchedResult(drop=ID OriginalID start);
set WORK.matched_result;
TIME = TIME - 300;
RUN; The output dataset looks like this: PRICE MIDPT TIME MIDPT_5 18.2 18.43 10:00:53 18.405 18.41 18.405 10:07:05 18.395 Hope it helps! Adrian
... View more