Hi, just reading about end= option in the set statement of data step in this thread.
I don't think the OP of that thread ever made another about the existence of start= option.
When I type start= or being= in the set statement, there is no pop up with links to doc, so I assume start= option does not actually exist in the set statement of the data step?
Hi @Nietzsche
Sorry, my text seems to have dissappeared.
There is no start= option. The end=xxx option sets the value of variable xxx to 1 (true) when the current observation is the last observation read into the program vector. The automatic variable _N_ holds the number of the current observation read into the program vector, so if _N_ = 1 is true in the first observation read. Neither the variable created by the end= option nor the automatic variable _N_ are written to the output data set.
Note that the data set options firstobs= and obs= control the observations read into the program vector and are applied first, so end= and _N_ works on the resulting subset. Try the code in the previous post and see what happens.
@Nietzsche wrote:
Hi, just reading about end= option in the set statement of data step in this thread.
I don't think the OP of that thread ever made another about the existence of start= option.
When I type start= or being= in the set statement, there is no pop up with links to doc, so I assume start= option does not actually exist in the set statement of the data step?
What would a START= option do? Would it signal the beginning of a data step (similar to END= signalling the end of a data step)? If that's what you want, you can use
if _n_=1 then do;
data a;
do obs = 1 to 5;
output;
end;
run;
data b;
set a end=eof;
if _N_ = 1 then firstobs = 1;
if eof then lastobs = 1;
run;
data c;
set a (firstobs=2 obs=4) end=eof;
if _N_ = 1 then firstobs = 1;
if eof then lastobs = 1;
run;
Hi @Nietzsche
Sorry, my text seems to have dissappeared.
There is no start= option. The end=xxx option sets the value of variable xxx to 1 (true) when the current observation is the last observation read into the program vector. The automatic variable _N_ holds the number of the current observation read into the program vector, so if _N_ = 1 is true in the first observation read. Neither the variable created by the end= option nor the automatic variable _N_ are written to the output data set.
Note that the data set options firstobs= and obs= control the observations read into the program vector and are applied first, so end= and _N_ works on the resulting subset. Try the code in the previous post and see what happens.
@ErikLund_Jensen, if I may, let me continue your thread and add something more.
The firstobs= and ons= works before end= and _N_, but we have to be aware when we are using them in composition with the WHERE statement:
data have;
do x = 1 to 3;
output;
end;
run;
data want;
set have(firstobs=2);
where x > 1;
run;
in this case the WHERE cuts "1" from input data set and then the firstobs= cuts "2" from what have left from filtering.
And a note about "start=" one thing is to use "_N_=1" but when we for example are reading several data sets with a single SET statement we can use the CUROBS= option to get info which observation we are reading into PDV, e.g.
data A B C;
do x = 1 to 3;
output;
end;
run;
data ABC;
set A B C curobs=curobs;
if curobs=1 then output;
run;
So "curobs=1" tests if we are reading the first observation from a give data set (of course if the data set has the first observation, what not always have to be the case).
Bart
Let me correct your language. _N_ does not count observations. It counts iterations of the data step. The confusion arises because in the normal simple data step:
data new;
set old;
run;
they amount to the same thing.
But once you get more complicated, say by using DOW loop, they diverge. For example in this data step the value of _N_ can be seen as a count of the number of ID values seen.
data want;
do until(last.id);
set old;
by id;
total=sum(total,amount)
end;
keep id total;
run;
But even in the simple data step you can see that the value of _N_ is different than "the number of observations read in". Most obviously is when it increments beyond the number of observations in the source dataset since such a data step will end at the SET statement and not the RUN statement.
2327 data want; 2328 put _n_= eof= ; 2329 set sashelp.class(obs=2) end=eof; 2330 run; _N_=1 eof=0 _N_=2 eof=0 _N_=3 eof=1 NOTE: There were 2 observations read from the data set SASHELP.CLASS. NOTE: The data set WORK.WANT has 2 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
I would go even one step further.
"_N_ does not count observations. It counts iterations of the data step." - _N_ in fact does not counts iterations, it is a placeholder for value of internal iterations counter. In deed, you can modify it and then at the beginning of the new iteration it is automatically updated with current iteration number:
data have;
do x = "A", "B", "C";
output;
end;
run;
data _null_;
put "1)" _all_;
set have;
put "2)" _all_;
do _N_ = 1 to 5;
put _N_= @;
end;
put;
put "3)" _all_;
put;
run;
Log:
1 data have;
2 do x = "A", "B", "C";
3 output;
4 end;
5 run;
NOTE: The data set WORK.HAVE has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
6
7 data _null_;
8 put "1)" _all_;
9 set have;
10 put "2)" _all_;
11
12 do _N_ = 1 to 5;
13 put _N_= @;
14 end;
15 put;
16
17 put "3)" _all_;
18 put;
19 run;
1)x= _ERROR_=0 _N_=1
2)x=A _ERROR_=0 _N_=1
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=A _ERROR_=0 _N_=6
1)x=A _ERROR_=0 _N_=2
2)x=B _ERROR_=0 _N_=2
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=B _ERROR_=0 _N_=6
1)x=B _ERROR_=0 _N_=3
2)x=C _ERROR_=0 _N_=3
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=C _ERROR_=0 _N_=6
1)x=C _ERROR_=0 _N_=4
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
Very good reading about looping is "The Magnificent DO" article by Paul Dorfman ( @hashman ), link is here: https://support.sas.com/resources/papers/proceedings13/126-2013.pdf
Bart
Cześć Bartku,
You're exactly right. Methinks the necessary and sufficient definition could be this:
Regardless of its current value, _N_ is assigned the next consecutive natural number every time program control is passed to the top of the DATA step (by the action of the implied loop).
Thus, since at first program control is at the top of the implied loop, 1 is moved to _N_. The next time program control is passed to the top of the implied loop, 2 is moved to _N_, and so forth. Hence, as you have indicated, the program can assign any numeric value to _N_ between two consecutive returns of program control to the top of the DATA step, yet it has no effect on the new value moved to _N_ at the top of the DATA step from the independent internal counter.
Perhaps one could say that an internal equivalent of the statement:
_N_ = monotonic() ;
is executed at the top of the implied loop.
Thanks for the plug 😉.
Pozdrowienia,
Paul D.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.