Solved: Does the start= option inside of a data step actually exist?

Nietzsche · Posted 03-19-2023 05:18 AM

Hi, just reading about end= option in the set statement of data step in this thread.

I don't think the OP of that thread ever made another about the existence of start= option.

When I type start= or being= in the set statement, there is no pop up with links to doc, so I assume start= option does not actually exist in the set statement of the data step?

SAS Base Programming (2022 Dec), Preparing for SAS Advanced Programming (Cancelled).

ErikLund_Jensen · Posted 03-19-2023 06:20 AM

Hi @Nietzsche

Sorry, my text seems to have dissappeared.

There is no start= option. The end=xxx option sets the value of variable xxx to 1 (true) when the current observation is the last observation read into the program vector. The automatic variable _N_ holds the number of the current observation read into the program vector, so if _N_ = 1 is true in the first observation read. Neither the variable created by the end= option nor the automatic variable _N_ are written to the output data set.

Note that the data set options firstobs= and obs= control the observations read into the program vector and are applied first, so end= and _N_ works on the resulting subset. Try the code in the previous post and see what happens.

View solution in original post

PaigeMiller · Posted 03-19-2023 05:57 AM

@Nietzsche wrote:

Hi, just reading about end= option in the set statement of data step in this thread.

I don't think the OP of that thread ever made another about the existence of start= option.

When I type start= or being= in the set statement, there is no pop up with links to doc, so I assume start= option does not actually exist in the set statement of the data step?

What would a START= option do? Would it signal the beginning of a data step (similar to END= signalling the end of a data step)? If that's what you want, you can use

if _n_=1 then do;

--
Paige Miller

ErikLund_Jensen · Posted 03-19-2023 06:08 AM

data a;
  do obs = 1 to 5;
    output;
  end;
run;

data b;
  set a end=eof;
  if _N_ = 1 then firstobs = 1;
  if eof then lastobs = 1;
run;

data c;
  set a (firstobs=2 obs=4) end=eof;
  if _N_ = 1 then firstobs = 1;
  if eof then lastobs = 1;
run;

ErikLund_Jensen · Posted 03-19-2023 06:20 AM

Hi @Nietzsche

Sorry, my text seems to have dissappeared.

There is no start= option. The end=xxx option sets the value of variable xxx to 1 (true) when the current observation is the last observation read into the program vector. The automatic variable _N_ holds the number of the current observation read into the program vector, so if _N_ = 1 is true in the first observation read. Neither the variable created by the end= option nor the automatic variable _N_ are written to the output data set.

Note that the data set options firstobs= and obs= control the observations read into the program vector and are applied first, so end= and _N_ works on the resulting subset. Try the code in the previous post and see what happens.

yabwon · Posted 03-19-2023 04:41 PM

@ErikLund_Jensen, if I may, let me continue your thread and add something more.

The firstobs= and ons= works before end= and _N_, but we have to be aware when we are using them in composition with the WHERE statement:

data have;
  do x = 1 to 3;
    output;
  end;
run;


data want;
  set have(firstobs=2);
  where x > 1;
run;

in this case the WHERE cuts "1" from input data set and then the firstobs= cuts "2" from what have left from filtering.

And a note about "start=" one thing is to use "_N_=1" but when we for example are reading several data sets with a single SET statement we can use the CUROBS= option to get info which observation we are reading into PDV, e.g.

data A B C;
  do x = 1 to 3;
    output;
  end;
run;

data ABC;
 set A B C curobs=curobs;

 if curobs=1 then output;
run;

So "curobs=1" tests if we are reading the first observation from a give data set (of course if the data set has the first observation, what not always have to be the case).

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

Tom · Posted 03-19-2023 05:53 PM

Let me correct your language. _N_ does not count observations. It counts iterations of the data step. The confusion arises because in the normal simple data step:

data new;
  set old;
run;

they amount to the same thing.

But once you get more complicated, say by using DOW loop, they diverge. For example in this data step the value of _N_ can be seen as a count of the number of ID values seen.

data want;
  do until(last.id);
     set old;
     by id;
     total=sum(total,amount)
  end;
  keep id total;
run;

But even in the simple data step you can see that the value of _N_ is different than "the number of observations read in". Most obviously is when it increments beyond the number of observations in the source dataset since such a data step will end at the SET statement and not the RUN statement.

2327  data want;
2328    put _n_= eof= ;
2329    set sashelp.class(obs=2) end=eof;
2330  run;

_N_=1 eof=0
_N_=2 eof=0
_N_=3 eof=1
NOTE: There were 2 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.WANT has 2 observations and 5 variables.
NOTE: DATA statement used (Total process time):
      real time           0.01 seconds
      cpu time            0.01 seconds

yabwon · Posted 03-20-2023 06:08 AM

I would go even one step further.

"_N_ does not count observations. It counts iterations of the data step." - _N_ in fact does not counts iterations, it is a placeholder for value of internal iterations counter. In deed, you can modify it and then at the beginning of the new iteration it is automatically updated with current iteration number:

data have;
  do x = "A", "B", "C";
    output;
  end;
run;

data _null_;
  put "1)" _all_;
  set have;
  put "2)" _all_;

  do _N_ = 1 to 5;
    put _N_= @;
  end;
  put;

  put "3)" _all_;
  put;
run;

Log:

1    data have;
2      do x = "A", "B", "C";
3        output;
4      end;
5    run;

NOTE: The data set WORK.HAVE has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds


6
7    data _null_;
8      put "1)" _all_;
9      set have;
10     put "2)" _all_;
11
12     do _N_ = 1 to 5;
13       put _N_= @;
14     end;
15     put;
16
17     put "3)" _all_;
18     put;
19   run;

1)x=  _ERROR_=0 _N_=1
2)x=A _ERROR_=0 _N_=1
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=A _ERROR_=0 _N_=6

1)x=A _ERROR_=0 _N_=2
2)x=B _ERROR_=0 _N_=2
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=B _ERROR_=0 _N_=6

1)x=B _ERROR_=0 _N_=3
2)x=C _ERROR_=0 _N_=3
_N_=1 _N_=2 _N_=3 _N_=4 _N_=5
3)x=C _ERROR_=0 _N_=6

1)x=C _ERROR_=0 _N_=4
NOTE: There were 3 observations read from the data set WORK.HAVE.
NOTE: DATA statement used (Total process time):
      real time           0.00 seconds
      cpu time            0.00 seconds

Very good reading about looping is "The Magnificent DO" article by Paul Dorfman ( @hashman ), link is here: https://support.sas.com/resources/papers/proceedings13/126-2013.pdf

Bart

_______________
Polish SAS Users Group: www.polsug.com and communities.sas.com/polsug

"SAS Packages: the way to share" at SGF2020 Proceedings (the latest version), GitHub Repository, and YouTube Video.
Hands-on-Workshop: "Share your code with SAS Packages"
"My First SAS Package: A How-To" at SGF2021 Proceedings

SAS Ballot Ideas: one: SPF in SAS, two, and three
SAS Documentation

hashman · Posted 03-20-2023 10:22 AM

Cześć Bartku,

You're exactly right. Methinks the necessary and sufficient definition could be this:

Regardless of its current value, _N_ is assigned the next consecutive natural number every time program control is passed to the top of the DATA step (by the action of the implied loop).

Thus, since at first program control is at the top of the implied loop, 1 is moved to _N_. The next time program control is passed to the top of the implied loop, 2 is moved to _N_, and so forth. Hence, as you have indicated, the program can assign any numeric value to _N_ between two consecutive returns of program control to the top of the DATA step, yet it has no effect on the new value moved to _N_ at the top of the DATA step from the independent internal counter.

Perhaps one could say that an internal equivalent of the statement:

_N_ = monotonic() ;

is executed at the top of the implied loop.

Thanks for the plug 😉.

Pozdrowienia,

Paul D.

Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Re: Does the start= option inside of a data step actually exist?

Registration is open

SAS Training: Just a Click Away