Right now the best thing I can find is reversing the order of my dataset and using LAG. Is there an alternative to this?
There're many. IMO, the simplest (which I don't find in the link by @Reeza, nor in the old link within the link) is reading the Kth record downstream from the current using SET POINT=K.
Here's an example for the most common scenario of creating K=1 lead values in every BY group:
data have ;
input id var ;
cards ;
1 1
2 2
2 3
3 4
3 5
3 6
;
run ;
data lead ;
set have ;
by id ;
_n_ + 1 ;
if not last.id then set have (keep=var rename=var=lead_var) point = _n_ ;
else call missing (lead_var) ;
run ;
Result:
id var lead_var --------------------- 1 1 . 2 2 3 2 3 . 3 4 5 3 5 6 3 6 .
If the leads are to be created throughout the file with no regard to BY grouping, the idea is the same, and the code is virtually identical:
data lead ;
set have end = z ;
_n_ + 1 ;
if not z then set have (keep=var rename=var=lead_var) point = _n_ ;
else call missing (lead_var) ;
run ;
Kind regards
Paul D.
To me the simplest (and fastest) is simply to use :
data T2;
merge T1
T1 (firstobs=2 keep=VAR rename=(VAR=LEAD_VAR) );
run;
It's simplest and fastest in this simple case. In the more common BY case it gets less simple since the records in each BY group have to be enumerated first, i.e., for example:
data have ;
input id var ;
cards ;
1 1
2 2
2 3
3 4
3 5
3 6
;
run ;
data v (keep=id var seq) / view = v ;
set have ;
by id ;
if first.id then seq = 1 ;
else seq + 1 ;
run ;
data lead (drop = seq) ;
merge have v (rename=var=lead_var where=(seq > 1)) ;
by id ;
if last.id then call missing (lead_var) ;
run ;
As far as the fastest goes, with this method the file ends up being read twice even when generating a lead is needed only for certain records upon a specific condition. E.g., picture a situation where one needs to replace VAR with its lead only when VAR is missing and such records constitute but a tiny fraction of the whole file. In such a case, it would be faster to grab the needed leads via POINT= only for those specific records, would it not?
Kind regards
Paul D.
@hashman For a BY group, I'd probably do something like this:
data LEAD2;
set HAVE;
by ID;
if ^LASTOBS then
set HAVE(firstobs=2 keep=VAR rename=(VAR=LEAD_VAR)) end=LASTOBS;
if last.ID then call missing (LEAD_VAR) ;
run;
If a small fraction of the values has to be retrieved, the losses due to direct POINT= access would be small, and as you point out this method would probably be preferable.
In any case, since we are reading the next observation, losses of either method are probably limited since we are always reading from the cache. One of the reads will be made from RAM, but . I random access logic has some overheads that sequential read does not.
data HAVE ; do VAR=1 to 1e7; output; end;
data WANT;
merge HAVE
HAVE (firstobs=2 keep=VAR rename=(VAR=LEAD_VAR) );
run;
/* real time 2.48 seconds
user cpu time 1.95 seconds */
data lead ;
set have end = z ;
_n_ + 1 ;
if not z then set have (keep=var rename=var=lead_var) point = _n_ ;
else call missing (lead_var) ;
run ;
/* real time 3.54 seconds
user cpu time 2.71 seconds */
@ChrisNZ: Amen to all that.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.