BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Junyong
Pyrite | Level 9

Suppose there is a panel data set indexed by "i" and "t," respectively.

data temp;
call streaminit(1);
do i=1 to 500;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

Assume that one picks up every first, second, and third observations for all "i" using FIRST and LAG.

data temp1;
set temp;
by i t;
if first.i or lag1(first.i) or lag2(first.i);
run;

Can one pick up every last, second last, and third last observations in a similar way? Though LAST is available for all the last observations, the second and third last observations are not easy.

data temp2;
set temp;
by i t;
if last.i;
run;

Is PROC SORT plus DESCENDING the only option here?

P.S. Though LAG7 and LAG8 are possible for this case, please assume a general unbalanced panel data set.

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds


331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds


343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds

 

View solution in original post

22 REPLIES 22
Astounding
PROC Star
Make two passes through the data. For example:

data want;
tot_count=0;
do until (last.id) ;
set have;
by id;
tot_count + 1;
end;
K=0;
do until (last.id) ;
set have;
by id;
k + 1;
if k in (1, 2, 3, tot_count, tot_count-1, tot_count-2) then output;
end;
drop k tot_count;
run;
ChrisNZ
Tourmaline | Level 20

Reading ahead works to:

data TEMP2;
  set TEMP nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;
run;

Reading 3 observations ahead allows to mark the first 3 of the shifted observations. These match with the last 3 of the original observations .

PGStats
Opal | Level 21

Random access keeps I/O to a minimum:

 

proc sort data=sashelp.class out=have; by sex name; run;

data last3;
set have; by sex;
if first.sex then start = _n_;
retain start;
if last.sex then do point = max(start, _n_ - 2) to _n_;
    set have point=point;
    output;
    end;
drop start;
run;

EDIT Added retain statement as suggested by @Tom .

PG
ChrisNZ
Tourmaline | Level 20

@PGStats 

POINT= is extremely slow. It is best avoided unless the number of records retrieved is low.

PGStats
Opal | Level 21

That's true in general @ChrisNZ , but here, the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

PG
ChrisNZ
Tourmaline | Level 20

> the records accessed at random have just been read sequentially, and there are only three of them per by-group. So, random access should be at least considered.

@PGStats Fair enough.

And if groups can have fewer than 3 observations, additional logic is required.

PGStats
Opal | Level 21

Variable START in my code accounts for that possibility.

PG
Tom
Super User Tom
Super User

POINT= seems to work great for this problem.

But a double DOW seems to be a little faster.

Using extra SET statements to look ahead seems to be slowest.

For 5,000,000 observations:

320   data temp2;
321     set temp1;
322     by I ;
323     if first.i then start = _n_;
324     retain start;
325     if last.i then do point = max(start, _n_ - 2) to _n_;
326       set temp1 point=point;
327       output;
328     end;
329     drop start;
330   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP2 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.57 seconds
      cpu time            0.60 seconds


331
332   data TEMP3;
333   do _n=1 by 1 until(last.i);
334     set temp1;
335     by i;
336   end;
337   do _n2=1 to _n;
338     set temp1;
339     if _n2 >= _n-2 then output;
340   end;
341   drop _n _n2;
342   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP3 has 1500000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
      real time           0.48 seconds
      cpu time            0.48 seconds


343
344
345   data TEMP4;
346     set TEMP1 end=eof1 ;
347     by I;
348     if not eof1 then set temp1(firstobs=2 keep=I rename=(i=i2)) end=eof2;
349     by i2 ;
350     if not eof2 then set temp1(firstobs=3 keep=I rename=(i=i3)) end=eof3;
351     by i3 ;
352     if last.i or (last.i2 and i=i2) or (last.i3 and i=i3);
353   run;

NOTE: There were 5000000 observations read from the data set WORK.TEMP1.
NOTE: There were 4999999 observations read from the data set WORK.TEMP1.
NOTE: There were 4999998 observations read from the data set WORK.TEMP1.
NOTE: The data set WORK.TEMP4 has 1500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           1.20 seconds
      cpu time            1.17 seconds

 

ChrisNZ
Tourmaline | Level 20

@Tom  For speed, my tests confirm your findings: DOW  > read-ahead > POINT=

 

 Having only one BY statement is probably the reason. BY statements really hinder speed.

ChrisNZ
Tourmaline | Level 20

@Tom A DOW with no BY seems ever so slightly faster

data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;

%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;

 

ChrisNZ
Tourmaline | Level 20

@Tom You use three SET BY, hence the poor speed, since BY adds so much overhead. They are not needed.

Tom
Super User Tom
Super User
How can you find the 3rd and 2nd to last without the BY statements?
ChrisNZ
Tourmaline | Level 20

@Tom See my first post

 

Here is my benchmark:

data temp1;
call streaminit(1);
do i=1 to 5e5;
do t=1 to 10;
y=rand("normal");
output;
end;
end;
run;

%* POINT=   ==> 6.99 real/6.3 CPU seconds;
data temp2;
  set temp1;
  by I ;
  if first.i then start = _n_;
  retain start;
  if last.i then do point = max(start, _n_ - 2) to _n_;
    set temp1 point=point;
    output;
  end;
  drop start;
run;

%* read ahead   ==> 4.74 real/3.82 CPU seconds;
data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        if keep;
run;

%* DOW + BY   ==> 3.12 real/2.62 CPU seconds;
data TEMP4;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

%* DOW no BY   ==> 3.09 real/2.45 CPU seconds;
data TEMP5;
do _n=1 by 1 until(_N >1 and I ne lag(I));
  set temp1;     
end;        
do _n2=1 to _n;
  set temp1 nobs=NOBS;
  if _N > _n2 >= _n-3 | _N2>=NOBS-3 then output;
end;
drop _n _n2;
run;

 

Tom
Super User Tom
Super User

@ChrisNZ  But it doesn't get the same output dataset.

data temp1;
call streaminit(1);
do i=1 to 5000;
do t=1 to ceil(i/10);
y=rand("normal");
output;
end;
end;
run;

data TEMP2;
do _n=1 by 1 until(last.i);
  set temp1;
  by i;
end;
do _n2=1 to _n;
  set temp1;
  if _n2 >= _n-2 then output;
end;
drop _n _n2;
run;

data TEMP3;
  set TEMP1 nobs=NOBS;
  by I T; 
  if ^LASTOBS then set TEMP1(firstobs=4 keep=I rename=(I=I1)) end=LASTOBS;
  by I1 ;
  if _N_ > 3 then if first.I1 or lag1(first.I1) or lag2(first.I1) then KEEP=1;
  if _N_ > NOBS-3  then KEEP=1;                        
  if keep;
run;
The COMPARE Procedure                                                                                      
Comparison of WORK.TEMP2 with WORK.TEMP3                                                                   
(Method=EXACT)                                                                                             
                                                                                                           
Observation Summary                                                                                        
                                                                                                           
Observation      Base  Compare  ID                                                                         
                                                                                                           
First Obs           1        .  i=1 t=1                                                                    
First Match         4        1  i=4 t=1                                                                    
Last  Obs       14970    14967  i=5000 t=500                                                               
                                                                                                           
Number of Observations in Common: 14967.                                                                   
Number of Observations in WORK.TEMP2 but not in WORK.TEMP3: 3.                                             
Total Number of Observations Read from WORK.TEMP2: 14970.                                                  
Total Number of Observations Read from WORK.TEMP3: 14967.                                                  
                                                                                                           
Number of Observations with Some Compared Variables Unequal: 0.                                            
Number of Observations with All Compared Variables Equal: 14967.                                           
                                                                                                           
NOTE: No unequal values were found. All values compared are exactly equal.  

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 22 replies
  • 10489 views
  • 6 likes
  • 8 in conversation