Pyrite | Level 9

## Picking up Every Second Last Observation

Suppose there is a panel data set indexed by "i" and "t," respectively.

``````data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;``````

Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.

``````data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;``````

Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.

``````data temp2;
set temp;
by i t;
if last.i;
run;``````

Is PROC SORT plus DESCENDING the only option here?

P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.

1 ACCEPTED SOLUTION

Accepted Solutions
Super User

## Re: Picking up Every Second Last Observation

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

```320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time           0.57 seconds
cpu time            0.60 seconds

331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time           0.48 seconds
cpu time            0.48 seconds

343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time           1.20 seconds
cpu time            1.17 seconds
```

22 REPLIES 22
PROC Star

## Re: Picking up Every Second Last Observation

Make two passes through the data. For example:

data want;
tot_count=0;
do until (last.id) ;
set have;
by id;
tot_count + 1;
end;
K=0;
do until (last.id) ;
set have;
by id;
k + 1;
if k in (1, 2, 3, tot_count, tot_count-1, tot_count-2) then output;
end;
drop k tot_count;
run;
Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

``````data TEMP2;
set TEMP nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3  then KEEP=1;
run;``````

Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .

Opal | Level 21

## Re: Picking up Every Second Last Observation

Random access keeps I/O to a minimum:

``````proc sort data=sashelp.class out=have; by sex name; run;

data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
set have point=point;
output;
end;
drop start;
run;``````

EDIT Added retain statement as suggested by @Tom .

PG
Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.

Opal | Level 21

## Re: Picking up Every Second Last Observation

That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

PG
Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

@PGStats Fair enough.

And if groups can have fewer than 3 observations, additional logic is required.

Opal | Level 21

## Re: Picking up Every Second Last Observation

Variable START in my code accounts for that possibility.

PG
Super User

## Re: Picking up Every Second Last Observation

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

```320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time           0.57 seconds
cpu time            0.60 seconds

331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time           0.48 seconds
cpu time            0.48 seconds

343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time           1.20 seconds
cpu time            1.17 seconds
```

Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

Having only one BY statement is probably the reason. BY statements really hinder speed.

Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

@Tom A DOW with no BY seems ever so slightly faster

``````data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;

data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
``````

Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.

Super User

## Re: Picking up Every Second Last Observation

How can you find the 3rd and 2nd to last without the BY statements?
Tourmaline | Level 20

## Re: Picking up Every Second Last Observation

@Tom See my first post

Here is my benchmark:

``````data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;

data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
``````

Super User

## Re: Picking up Every Second Last Observation

@ChrisNZ  But it doesn't get the same output dataset.

``````data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;

data TEMP2;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3  then KEEP=1;
if keep;
run;``````
```The COMPARE Procedure
Comparison of WORK.TEMP2 with WORK.TEMP3
(Method=EXACT)

Observation Summary

Observation      Base  Compare  ID

First Obs           1        .  i=1 t=1
First Match         4        1  i=4 t=1
Last  Obs       14970    14967  i=5000 t=500

Number of Observations in Common: 14967.
Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3.
Total Number of Observations Read from WORK.TEMP2: 14970.
Total Number of Observations Read from WORK.TEMP3: 14967.

Number of Observations with Some Compared Variables Unequal: 0.
Number of Observations with All Compared Variables Equal: 14967.

NOTE: No unequal values were found. All values compared are exactly equal.  ```
Discussion stats
• 22 replies
• 10439 views
• 6 likes
• 8 in conversation