Suppose there is a panel data set indexed by "i" and "t," respectively.
data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.
data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;
Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.
data temp2;
set temp;
by i t;
if last.i;
run;
Is PROC SORT plus DESCENDING the only option here?
P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds
Reading ahead works to:
data TEMP2;
set TEMP nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
run;
Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .
Random access keeps I/O to a minimum:
proc sort data=sashelp.class out=have; by sex name; run;
data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
set have point=point;
output;
end;
drop start;
run;
EDIT Added retain statement as suggested by @Tom .
POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.
That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
@PGStats Fair enough.
And if groups can have fewer than 3 observations, additional logic is required.
Variable START in my code accounts for that possibility.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds
@Tom For speed, my tests confirm your findings: DOW > read-ahead > POINT=
Having only one BY statement is probably the reason. BY statements really hinder speed.
@Tom A DOW with no BY seems ever so slightly faster
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.
@Tom See my first post
Here is my benchmark:
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@ChrisNZ But it doesn't get the same output dataset.
data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;
data TEMP2;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
if keep;
run;
The COMPARE Procedure Comparison of WORK.TEMP2 with WORK.TEMP3 (Method=EXACT) Observation Summary Observation Base Compare ID First Obs 1 . i=1 t=1 First Match 4 1 i=4 t=1 Last Obs 14970 14967 i=5000 t=500 Number of Observations in Common: 14967. Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3. Total Number of Observations Read from WORK.TEMP2: 14970. Total Number of Observations Read from WORK.TEMP3: 14967. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 14967. NOTE: No unequal values were found. All values compared are exactly equal.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.