Suppose there is a panel data set indexed by "i" and "t," respectively.
data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.
data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;
Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.
data temp2;
set temp;
by i t;
if last.i;
run;
Is PROC SORT plus DESCENDING the only option here?
P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds
Reading ahead works to:
data TEMP2;
set TEMP nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
run;
Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .
Random access keeps I/O to a minimum:
proc sort data=sashelp.class out=have; by sex name; run;
data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
set have point=point;
output;
end;
drop start;
run;
EDIT Added retain statement as suggested by @Tom .
POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.
That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
@PGStats Fair enough.
And if groups can have fewer than 3 observations, additional logic is required.
Variable START in my code accounts for that possibility.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds
@Tom For speed, my tests confirm your findings: DOW > read-ahead > POINT=
Having only one BY statement is probably the reason. BY statements really hinder speed.
@Tom A DOW with no BY seems ever so slightly faster
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.
@Tom See my first post
Here is my benchmark:
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@ChrisNZ But it doesn't get the same output dataset.
data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;
data TEMP2;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
if keep;
run;
The COMPARE Procedure Comparison of WORK.TEMP2 with WORK.TEMP3 (Method=EXACT) Observation Summary Observation Base Compare ID First Obs 1 . i=1 t=1 First Match 4 1 i=4 t=1 Last Obs 14970 14967 i=5000 t=500 Number of Observations in Common: 14967. Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3. Total Number of Observations Read from WORK.TEMP2: 14970. Total Number of Observations Read from WORK.TEMP3: 14967. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 14967. NOTE: No unequal values were found. All values compared are exactly equal.
Catch the best of SAS Innovate 2025 — anytime, anywhere. Stream powerful keynotes, real-world demos, and game-changing insights from the world’s leading data and AI minds.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.