Suppose there is a panel data set indexed by "i" and "t," respectively.
data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.
data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.
data temp2;
set temp;
by i t;
if last.i;
run;Is PROC SORT plus DESCENDING the only option here?
P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds
331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds
343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds
Reading ahead works to:
data TEMP2;
  set TEMP nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;
run;Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .
Random access keeps I/O to a minimum:
proc sort data=sashelp.class out=have; by sex name; run;
data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
    set have point=point;
    output;
    end;
drop start;
run;EDIT Added retain statement as suggested by @Tom .
POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.
That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.
@PGStats Fair enough.
And if groups can have fewer than 3 observations, additional logic is required.
Variable START in my code accounts for that possibility.
POINT= seems to work great for this problem.
But a double DOW seems to be a little faster.
Using extra SET statements to look ahead seems to be slowest.
For 5,000,000 observations:
320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds
331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds
343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds
@Tom For speed, my tests confirm your findings: DOW > read-ahead > POINT=
Having only one BY statement is probably the reason. BY statements really hinder speed.
@Tom A DOW with no BY seems ever so slightly faster
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;
%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;
%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.
@Tom See my first post
Here is my benchmark:
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;
%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;
%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
@ChrisNZ But it doesn't get the same output dataset.
data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;
data TEMP2;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        
  if keep;
run;The COMPARE Procedure                                                                                      
Comparison of WORK.TEMP2 with WORK.TEMP3                                                                   
(Method=EXACT)                                                                                             
                                                                                                           
Observation Summary                                                                                        
                                                                                                           
Observation      Base  Compare  ID                                                                         
                                                                                                           
First Obs           1        .  i=1 t=1                                                                    
First Match         4        1  i=4 t=1                                                                    
Last  Obs       14970    14967  i=5000 t=500                                                               
                                                                                                           
Number of Observations in Common: 14967.                                                                   
Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3.                                             
Total Number of Observations Read from WORK.TEMP2: 14970.                                                  
Total Number of Observations Read from WORK.TEMP3: 14967.                                                  
                                                                                                           
Number of Observations with Some Compared Variables Unequal: 0.                                            
Number of Observations with All Compared Variables Equal: 14967.                                           
                                                                                                           
NOTE: No unequal values were found. All values compared are exactly equal.  
					
				
			
			
				
			
			
			
			
			
			
			
		It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.
