Solved: Re: Picking up Every Second Last Observation

Junyong · Posted 04-07-2019 07:39 PM

Suppose there is a panel data set indexed by "i" and "t," respectively.

data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.

data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;

Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.

data temp2;
set temp;
by i t;
if last.i;
run;

Is PROC SORT plus DESCENDING the only option here?

P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.

Tom · Posted 04-08-2019 12:07 AM

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds


331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds


343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds

View solution in original post

Astounding · Posted 04-07-2019 10:07 PM

Make two passes through the data. For example:

data want;
tot_count=0;
do until (last.id) ;
set have;
by id;
tot_count + 1;
end;
K=0;
do until (last.id) ;
set have;
by id;
k + 1;
if k in (1, 2, 3, tot_count, tot_count-1, tot_count-2) then output;
end;
drop k tot_count;
run;

ChrisNZ · Posted 04-07-2019 11:00 PM

Reading ahead works to:

data TEMP2;
  set TEMP nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;
run;

Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .

High-Performance SAS Coding - Third Edition

PGStats · Posted 04-07-2019 11:20 PM

Random access keeps I/O to a minimum:

proc sort data=sashelp.class out=have; by sex name; run;

data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
    set have point=point;
    output;
    end;
drop start;
run;

EDIT Added retain statement as suggested by @Tom .

PG

ChrisNZ · Posted 04-07-2019 11:41 PM

@PGStats

POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.

High-Performance SAS Coding - Third Edition

PGStats · Posted 04-07-2019 11:52 PM

That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

PG

ChrisNZ · Posted 04-07-2019 11:59 PM

> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

@PGStats Fair enough.

And if groups can have fewer than 3 observations, additional logic is required.

High-Performance SAS Coding - Third Edition

PGStats · Posted 04-08-2019 12:02 AM

Variable START in my code accounts for that possibility.

PG

Tom · Posted 04-08-2019 12:07 AM

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds


331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds


343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds

ChrisNZ · Posted 04-08-2019 12:32 AM

@Tom For speed, my tests confirm your findings: DOW > read-ahead > POINT=

Having only one BY statement is probably the reason. BY statements really hinder speed.

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-08-2019 12:52 AM

@Tom A DOW with no BY seems ever so slightly faster

data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;

%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;

High-Performance SAS Coding - Third Edition

ChrisNZ · Posted 04-08-2019 05:36 PM

@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.

High-Performance SAS Coding - Third Edition

Tom · Posted 04-08-2019 05:39 PM

How can you find the 3rd and 2nd to last without the BY statements?

ChrisNZ · Posted 04-08-2019 05:58 PM

@Tom See my first post

Here is my benchmark:

data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;

%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;

High-Performance SAS Coding - Third Edition

Tom · Posted 04-08-2019 08:02 PM

@ChrisNZ But it doesn't get the same output dataset.

data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;

data TEMP2;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        
  if keep;
run;

The COMPARE Procedure                                                                                      
Comparison of WORK.TEMP2 with WORK.TEMP3                                                                   
(Method=EXACT)                                                                                             
                                                                                                           
Observation Summary                                                                                        
                                                                                                           
Observation      Base  Compare  ID                                                                         
                                                                                                           
First Obs           1        .  i=1 t=1                                                                    
First Match         4        1  i=4 t=1                                                                    
Last  Obs       14970    14967  i=5000 t=500                                                               
                                                                                                           
Number of Observations in Common: 14967.                                                                   
Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3.                                             
Total Number of Observations Read from WORK.TEMP2: 14970.                                                  
Total Number of Observations Read from WORK.TEMP3: 14967.                                                  
                                                                                                           
Number of Observations with Some Compared Variables Unequal: 0.                                            
Number of Observations with All Compared Variables Equal: 14967.                                           
                                                                                                           
NOTE: No unequal values were found. All values compared are exactly equal.

Classroom Training Available!