Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- Programming
- /
- Re: Picking up Every Second Last Observation

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

🔒 This topic is **solved** and **locked**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 04-07-2019 07:39 PM
(10449 views)

Suppose there is a panel data set indexed by "i" and "t," respectively.

```
data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
```

Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.

```
data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;
```

Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.

```
data temp2;
set temp;
by i t;
if last.i;
run;
```

Is PROC SORT plus DESCENDING the only option here?

P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds

22 REPLIES 22

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Make two passes through the data. For example:

data want;

tot_count=0;

do until (last.id) ;

set have;

by id;

tot_count + 1;

end;

K=0;

do until (last.id) ;

set have;

by id;

k + 1;

if k in (1, 2, 3, tot_count, tot_count-1, tot_count-2) then output;

end;

drop k tot_count;

run;

data want;

tot_count=0;

do until (last.id) ;

set have;

by id;

tot_count + 1;

end;

K=0;

do until (last.id) ;

set have;

by id;

k + 1;

if k in (1, 2, 3, tot_count, tot_count-1, tot_count-2) then output;

end;

drop k tot_count;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Reading ahead works to:

```
data TEMP2;
set TEMP nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
run;
```

Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Random access keeps I/O to a minimum:

```
proc sort data=sashelp.class out=have; by sex name; run;
data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
set have point=point;
output;
end;
drop start;
run;
```

EDIT Added retain statement as suggested by @Tom .

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

*> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.*

@PGStats Fair enough.

And if groups can have fewer than 3 observations, additional logic is required.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Variable START in my code accounts for that possibility.

PG

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320 data temp2; 321 set temp1; 322 by I ; 323 if first.i then start = _n_; 324 retain start; 325 if last.i then do point = max(start, _n_ - 2) to _n_; 326 set temp1 point=point; 327 output; 328 end; 329 drop start; 330 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.57 seconds cpu time 0.60 seconds 331 332 data TEMP3; 333 do _n=1 by 1 until(last.i); 334 set temp1; 335 by i; 336 end; 337 do _n2=1 to _n; 338 set temp1; 339 if _n2 >= _n-2 then output; 340 end; 341 drop _n _n2; 342 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables. NOTE: DATA statement used (Total process time): real time 0.48 seconds cpu time 0.48 seconds 343 344 345 data TEMP4; 346 set TEMP1 end=eof1 ; 347 by I; 348 if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2; 349 by i2 ; 350 if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3; 351 by i3 ; 352 if last.i or (last.i2 and i=i2) or (last.i3 and i=i3); 353 run; NOTE: There were 5000000 observations read from the data set WORK.TEMP1. NOTE: There were 4999999 observations read from the data set WORK.TEMP1. NOTE: There were 4999998 observations read from the data set WORK.TEMP1. NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 1.20 seconds cpu time 1.17 seconds

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Tom For speed, my tests confirm your findings: DOW > read-ahead > POINT=

Having only one BY statement is probably the reason. BY statements really hinder speed.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Tom A DOW with no BY seems ever so slightly faster

```
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

How can you find the 3rd and 2nd to last without the BY statements?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@Tom See my first post

Here is my benchmark:

```
data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;
%* POINT= ==> 6.99 real/6.3 CPU seconds;
data temp2;
set temp1;
by I ;
if first.i then start = _n_;
retain start;
if last.i then do point = max(start, _n_ - 2) to _n_;
set temp1 point=point;
output;
end;
drop start;
run;
%* read ahead ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1; if keep;
run;
%* DOW + BY ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
%* DOW no BY ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
set temp1;
end;
do _n2=1 to _n;
set temp1 nobs=NOBS;
if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

@ChrisNZ But it doesn't get the same output dataset.

```
data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;
data TEMP2;
do _n=1 by 1 until(last.i);
set temp1;
by i;
end;
do _n2=1 to _n;
set temp1;
if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;
data TEMP3;
set TEMP1 nobs=NOBS;
by I T;
if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
by I1 ;
if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
if _N_ > NOBS-3 then KEEP=1;
if keep;
run;
```

The COMPARE Procedure Comparison of WORK.TEMP2 with WORK.TEMP3 (Method=EXACT) Observation Summary Observation Base Compare ID First Obs 1 . i=1 t=1 First Match 4 1 i=4 t=1 Last Obs 14970 14967 i=5000 t=500 Number of Observations in Common: 14967. Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3. Total Number of Observations Read from WORK.TEMP2: 14970. Total Number of Observations Read from WORK.TEMP3: 14967. Number of Observations with Some Compared Variables Unequal: 0. Number of Observations with All Compared Variables Equal: 14967. NOTE: No unequal values were found. All values compared are exactly equal.

**Available on demand!**

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.