BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
weg
Obsidian | Level 7 weg
Obsidian | Level 7

Hello, I am a beginning SAS programmer trying to wrap my head around understand the idea behind putting a set statement inside of a do loop. My general understanding of `set lib.table` is that it reads the _N_'th observation from the  given table where _N_ is the PDV's internal counter.  Inside of a do loop I was surprised to find that it looks like it has its own _N_ that starts back from 1, I.E:

 if _N_ = 1 do i=10 to 20; set lib.table; end;

  will give you the first 10 observations from lib.table regardless of the index of i, or the current _N_ of the data step.  That is, the above code appears no different then  if I had started on the 10th observation for an appropriately sized data set:

 

 

 

 if _N_ = 10 then do i=10 to 20; set lib.table; end;

 

 

also, the following code threw me for a loop:

data test;
        input name : $ 13. rate;
	datalines;
A 1
B 2
C 3
;
run;

data test2;
	set test;

	do var=1 to 2;
		set test;
		output;
	end;
run;

Here the output table is

name     rate     var
 A	  1	   1
 B 	  2	   2
 C 	  3	   1

Q1   so my understanding is the data step starts its iteration _N_= 1, gets to the do loop and writes A and B rows to data steps, exits, goes to _N_ = 2, starts the do loop and surprisingly somehow retains its internal _N2_ value to go to C row,  goes to i=2 and runs out of data so calls a quit and both the do and data step are exited out of.   Is this correct understanding? Its kind of throwing me for a loop that the internal do statements variables are scoped to the entire data step instead of each iteration...

 

my other questions

Q2   It kind of bothers me that set changes its meaning inside of a do statement ( the idea seems similar to subsetting if statements in that some code is changing the behavior other code without referencing it), so I'd love some reason I can give myself for why that is rational behavior, and what class of other sas commands I can expect to change behaviors inside of something else.

 

Q3    Is my idea right that everything hangs on some internal _N_ counter? using set multiple times doesnt go to the next row, so it has to be some inherent property of what is calling it that indicates what row set will access.

 

Q4    say I wanted to use set and a do loop from the 50th to 60th observation.  Is there a nice way to do that instead of some if statement while iterating through every observation?

 

thank you

1 ACCEPTED SOLUTION

Accepted Solutions
Quentin
Super User

Hi,

 

These are really EXCELLENT questions, especially for a beginner.  I wish I had asked these questions earlier in my SAS programming.  It took me a few years to really understand the DATA step.

 

Importantly, your general understanding is wrong:

My general understanding of `set lib.table` is that it reads the _N_'th observation from the  given table where _N_ is the PDV's internal counter.

 

The DATA step is an implied loop.  _N_ is simply a counter of the number of times the loop has executed.  There is no causal relationship / dependency between _N_ and the SET statement.  When the SET statement executes, it reads the next record from a data set.  What does "next record" mean? The SET statement has its own pointer which tracks which record to read from a data set.  There is often a correlation between _N_ and the SET statement because many steps happen to execute the SET statement once for each iteration of the DATA step.  But that is just a correlation.

 

The below step iterates 20 times.  The log shows that on each iteration of the DATA step, one record is read (and output):

 

data want ;
  put "Top of loop " _N_= ;
  set sashelp.class ;
  put "Bottom of loop " _N_= Name= ;
run ;

 

On the 20th iteration of the loop, the SET statement executes.  Because there is no next record to read, it hits the end of file marker and that causes the DATA step to stop executing.  Note that the "Bottom of loop" PUT statement does not execute on the 20th iteration, because the step stopped executing when the SET statement executed.

 

 

The below step iterates only four times, but also reads and outputs all 19 records from sashelp.class.

data want ;
  put "Top of loop " _N_= ;
  do i=1 to 5 ;
    set sashelp.class ;
    output ;
    put "Inside DO loop " _N_= i= Name= ;
  end ;
  put "Bottom of loop " _N_= Name= /;
run ;

 

On the first iteration of the DATA step (_N_=1), the explicit do loop iterates 5 times, so the SET statement reads the first 5 records.  On the second iteration of the DATA step (_N_=2), the explicit do loop iterates 5 times, so the SET statement reads records 6-10.  On the 4th iteration of the DATA step (_N_=4), the explicit do loop iterates 5 times.  On i=1 to i=4, the SET statement reads records 16-19.  When i=5, the SET statement tries to read the next record, it hits the end of file marker, and the DATA step stops executing immediately.

 

In answer to other questions:

1. The SET statement does NOT change its meaning inside of a DO loop.  That would be chaos if it did.  If this is not clear, please post an example where you think this is happening.

2. Each SET statement that is reading from a data set uses its own internal pointer to keep track of which record to read.  If there are two SET statements in a step, each still has its own pointer, and the two SET statements are independent of each other.  

 

To work through your second example (sorry, I'm out of time to write more), I would recommend you add some PUT statements, something like below, and remember that each SET statement has it's own pointer, the two SET statements are independent of each other, and there is only one PDV for the step.

data test2;
  length name $13 ;
  put "Top of DATA step " (_N_ Var Name Rate)(=) ;
  set test;
  put "Before DO loop " (_N_ Var Name Rate)(=) ;

  do var=1 to 2;
    put "Top of DO loop "  (_N_ Var Name Rate)(=) ; 
    set test;
    output;
    put "Bottom of DO loop "  (_N_ Var Name Rate)(=) ; 
  end;
  put "Bottom of DATA step " (_N_ Var Name Rate)(=)  /;
run;

 

If you can't figure out what is happening, respond with more questions. I or others will happily explain more.  These are good questions, and will raise a lot of issues critical to understanding DATA step programming.  

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.

View solution in original post

18 REPLIES 18
ChrisNZ
Tourmaline | Level 20

Please take note that:

1. Each SET statement uses its own internal pointer.

2. _N_ is a data step iteration counter.

This should answer your questions and explain what you are seeing. Tell us if it doesn't.

 

GGO
Obsidian | Level 7 GGO
Obsidian | Level 7

Understanding the "Whitlock DO-loop" should also help, and who better to explain it that Paul Dorfman:


https://www.lexjansen.com/nesug/nesug08/hw/hw02.pdf

weg
Obsidian | Level 7 weg
Obsidian | Level 7

Looks like I have some reading to do, but at least its a well defined concept. it just seems so janky

weg
Obsidian | Level 7 weg
Obsidian | Level 7

 

By internal pointer, I am assuming that it works the same way as _N_? and it only is thrown out when the data step completes?

ChrisNZ
Tourmaline | Level 20

> By internal pointer, I am assuming that it works the same way as _N_? and it only is thrown out when the data step completes?

The internal pointer is invisible to you. It is used each time a SET statement is executed.

 

Some actually requested that it be made public, is some form.  Here.

Tom
Super User Tom
Super User

@ChrisNZ wrote:

Please take note that:

1. Each SET statement uses its own internal pointer.

2. _N_ is a data step iteration counter.

This should answer your questions and explain what you are seeing. Tell us if it doesn't.

 


You left out the fact that most SAS data step stop in the middle of the code when they attempt to read past the end of one of the inputs.

 

ChrisNZ
Tourmaline | Level 20

> You left out the fact that most SAS data step stop in the middle of the code when they attempt to read past the end of one of the inputs.

 

Indeed that's a relevant fact. Add that as point 3.

Tom
Super User Tom
Super User

Too many questions in one post.  Let's look at your first example:

data test;
  input name :$13. rate @@;
datalines;
A 1 B 2 C 3
;

data test2;
  set test;
  do var=1 to 2;
    set test;
    output;
  end;
run;

So when _N_=1 you read the first observation.  Then in the loop your read the first two observations and write them.

On _N_=2 you read the second observation.  Then in the loop you read the third observations and write. Then attempt to read a fourth observation and run out of data so the step ends.  Check the log the notes will show this.

NOTE: There were 2 observations read from the data set WORK.TEST.
NOTE: There were 3 observations read from the data set WORK.TEST.
NOTE: The data set WORK.TEST2 has 3 observations and 3 variables.

 

Tom
Super User Tom
Super User

Q2   It kind of bothers me that set changes its meaning inside of a do statement ( the idea seems similar to subsetting if statements in that some code is changing the behavior other code without referencing it), so I'd love some reason I can give myself for why that is rational behavior, and what class of other sas commands I can expect to change behaviors inside of something else.

I have no idea what this question is about. The meaning of a SET statement does not change. It means read an observation from a dataset (or series of datasets).

weg
Obsidian | Level 7 weg
Obsidian | Level 7

I have a lot better understanding of whats happening due to everyone's answers. The idea of having an internal counter tripped me up, and it was not expected behavior  that the set statement would have it.  I expected do i=1 to 10; set table; output;  end; to call set 10 times and write whatever the _N_th observation was 10 times to the output table.  Also, the fact that the internal pointer is for each set statement and seems to last until the data step ends is different than say in python, where the counter only lasts until the do loop finishes. 

 

 

I appreciate everyone help

Tom
Super User Tom
Super User

@weg wrote:

I have a lot better understanding of whats happening due to everyone's answers. The idea of having an internal counter tripped me up, and it was not expected behavior  that the set statement would have it.  I expected do i=1 to 10; set table; output;  end; to call set 10 times and write whatever the _N_th observation was 10 times to the output table.  Also, the fact that the internal pointer is for each set statement and seems to last until the data step ends is different than say in python, where the counter only lasts until the do loop finishes. 

 

 

I appreciate everyone help


We used to call languages like SAS 4GL, fourth generation languages. As opposed to machine language, symbolic assembly languages or programming languages (like FORTRAN or COBOL).  You are trying to think of the data step language as if it was one of those lower level languages.

 

It helps to think it terms of the data step operating on the whole input instead of thinking you have to program the loop over the input.  More like the set operation logic of SQL code.  Or more like the object oriented concepts of modern languages.

 

Although as your examples show you can get the data step to operate at the lower level of detail if you have to.

Quentin
Super User

@Tom wrote:

It helps to think it terms of the data step operating on the whole input instead of thinking you have to program the loop over the input.  More like the set operation logic of SQL code.  Or more like the object oriented concepts of modern languages.

 

I'm really surprised you would say that, Tom.  When I first learned SAS, I didn't think about the DATA step as iterating, and didn't think of the PDV.  So when I tried to learn about what the RETAIN statement does, or how the MERGE statement works, or BY group processing, or LAG, or ... it was an almost hopeless exercise.  I've always thought that in order to use the DATA step, it was essential to understand how it iterates through data as it reads, which is very different than the set operation logic of SQL.

 

You really find it helpful to think of the DATA step as operating on a set, rather than iterating over it?

BASUG is hosting free webinars Next up: Mike Sale presenting Data Warehousing with SAS April 10 at noon ET. Register now at the Boston Area SAS Users Group event page: https://www.basug.org/events.
Tom
Super User Tom
Super User

Absolutely it helps.  Take the example question about find the 50th observation.  That is a question about what input should be given to the data step. So it should be solved by identifying the data by the keys in the data, not its position. So first response should be to subset the data on the way into the data step with a WHERE statement or WHERE= dataset option.   And if your data model is poor and you did need to resort filtering it by position in the dataset then by the dataset options OBS= and FIRSTOBS=. No need to begin by thinking have to open file, initialize pointer, read from pointer, increment pointer, ... what ever low level operations that actually have to happen. SAS has happily shielded you from having to think about that stuff.

 

That said all of the quirky "features" of the datastep (retain, 1 to many merge, many to many merge) can be explained more clearly once you understand how the data step actually operates and what each of the individual statements actually do.  But not is NOT the first way to approach how to use a data step to perform a task.

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 812 views
  • 9 likes
  • 7 in conversation