- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I have the following dataset:
data banks;
input name : $ 13. rate;
datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;
I am comparing between two blocks of codes.
First code block:
data newbank (drop = year);
do year=1 to 3;
set banks;
output;
end;
run;
proc print data = newbank;
run;
Output to first code block:
Obs name rate
1 FirstCapital 0.0718
2 DirectBank 0.0721
3 VirtualDirect 0.0728
Second code block:
data newbank;
set banks;
output;
set banks;
output;
set banks;
output;
run;
proc print data = newbank;
run;
Output to second code block:
Obs name rate
1 FirstCapital 0.0718
2 FirstCapital 0.0718
3 FirstCapital 0.0718
4 DirectBank 0.0721
5 DirectBank 0.0721
6 DirectBank 0.0721
7 VirtualDirect 0.0728
8 VirtualDirect 0.0728
9 VirtualDirect 0.0728
Question: Why doesn't the first code block have the same output as the second code block? Why is there a difference if I use a set statement in a do loop vs using a set statement not in a do loop?
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
In Programming 1 (a free e-learning class) we discuss the concept of the Program Data Vector (PDV) and how SAS loads the input dataset information into the PDV and how information is written OUT of the PDV to the new dataset.
Take a look at these alternate scenarios. I have changed your orginal code, so there are numbered output files, so that YEAR is kept and a new variable named WHATSET tells you what SET statement in the program supplied the variable information for that row. When you understand the PDV, you will understand why my #1 and my #2 are the same and only have 3 observations in the output dataset. And, when you understand the PDV, you will understand why the variable WHATSET is 'set3' in #6, but is 'set1' in #1 and #2.
#3 and #4 produce identical output -- 9 observations -- one program with a DO loop and one program without a DO loop. #5 also produces 9 observations, but with slightly different values for YEAR and WHATSET.
Here's the code that produces all the outputs, including the 2 shown above.
data banks;
input name : $ 13. rate;
datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;
data newbank1;
do year=1 to 3;
set banks;
whatset='set1';
output;
end;
run;
proc print data = newbank1;
title 'NewBank1 -- SET inside DO loop';
run;
data newbank2;
set banks;
year = _n_;
whatset = 'set1';
output;
run;
proc print data = newbank2;
title 'NewBank2 -- NO DO loop at all';
run;
data newbank3;
set banks;
whatset = 'set1';
year=1;
output;
year = 2;
output;
year = 3;
output;
run;
proc print data = newbank3;
title 'NewBank3 -- NO DO loop but have 3 outputs';
run;
data newbank4;
set banks;
whatset = 'set1';
do year=1 to 3;
output;
end;
run;
proc print data = newbank4;
title 'NewBank4 -- SET outside DO loop but OUTPUT inside DO loop';
run;
data newbank5;
set banks ;
whatset = 'set1';
year = _n_;
output;
set banks ;
whatset = 'set2';
year = _n_;
output;
set banks;
whatset = 'set3';
year = _n_;
output;
run;
proc print data = newbank5;
title 'NewBank5 -- NO DO Loop 3 SET statements and 3 OUTPUTS';
run;
data newbank6;
set banks ;
whatset = 'set1';
year = _n_;
set banks ;
whatset = 'set2';
year = _n_;
set banks;
whatset = 'set3';
year = _n_;
output;
run;
proc print data = newbank6;
title 'NewBank6 -- NO DO Loop 3 SET statements ONLY 1 OUTPUT';
run;
What did you expect to get from your original code with the SET inside the DO loop? Based on how SAS works, the DO Loop didn't get you much if all you wanted to do was read one record and write one record. The DATA step (as you can see by #2) is an implied looping structure. It reads every obs in the input dataset until the end of file is reached. In #2, for every obs that is read, an obs is written. You have to understand HOW to use the SET statement without a DO loop and how to use the OUTPUT statement without a DO loop before you start to put a SET statement inside a DO loop (IMO).
Cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
What's happening in the second code block is that you effectively have three versions of the bank data set open so it will write the first observation three times, once for each version of the data set and so on.
You sometimes see the same data set open more than once in a data step but it is a usually in the context of a DOW loop http://support.sas.com/resources/papers/proceedings12/052-2012.pdf
(editor's note: removed trailing blank from link)
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I see. What about in the first code block? Don't I have the data set open three times as well since the set statement is inside a DO loop?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Also, can you resend the pdf link? It doesn't seem to work for me.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
No because in the first code block you only have one SET statement - the first time it executes it opens the data set and reads the first observation, the second time it reads the second observation etc. In the second code block you have three SET statements, each of which opens the data set and reads the first statement and so on.
I'll try the link again http://support.sas.com/resources/papers/proceedings12/052-2012.pdf
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Ah got it. That was helpful. Thank you again!
I would accept this as a solution as well but I'm not sure how to have multiple answers as solutions.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi:
In Programming 1 (a free e-learning class) we discuss the concept of the Program Data Vector (PDV) and how SAS loads the input dataset information into the PDV and how information is written OUT of the PDV to the new dataset.
Take a look at these alternate scenarios. I have changed your orginal code, so there are numbered output files, so that YEAR is kept and a new variable named WHATSET tells you what SET statement in the program supplied the variable information for that row. When you understand the PDV, you will understand why my #1 and my #2 are the same and only have 3 observations in the output dataset. And, when you understand the PDV, you will understand why the variable WHATSET is 'set3' in #6, but is 'set1' in #1 and #2.
#3 and #4 produce identical output -- 9 observations -- one program with a DO loop and one program without a DO loop. #5 also produces 9 observations, but with slightly different values for YEAR and WHATSET.
Here's the code that produces all the outputs, including the 2 shown above.
data banks;
input name : $ 13. rate;
datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;
data newbank1;
do year=1 to 3;
set banks;
whatset='set1';
output;
end;
run;
proc print data = newbank1;
title 'NewBank1 -- SET inside DO loop';
run;
data newbank2;
set banks;
year = _n_;
whatset = 'set1';
output;
run;
proc print data = newbank2;
title 'NewBank2 -- NO DO loop at all';
run;
data newbank3;
set banks;
whatset = 'set1';
year=1;
output;
year = 2;
output;
year = 3;
output;
run;
proc print data = newbank3;
title 'NewBank3 -- NO DO loop but have 3 outputs';
run;
data newbank4;
set banks;
whatset = 'set1';
do year=1 to 3;
output;
end;
run;
proc print data = newbank4;
title 'NewBank4 -- SET outside DO loop but OUTPUT inside DO loop';
run;
data newbank5;
set banks ;
whatset = 'set1';
year = _n_;
output;
set banks ;
whatset = 'set2';
year = _n_;
output;
set banks;
whatset = 'set3';
year = _n_;
output;
run;
proc print data = newbank5;
title 'NewBank5 -- NO DO Loop 3 SET statements and 3 OUTPUTS';
run;
data newbank6;
set banks ;
whatset = 'set1';
year = _n_;
set banks ;
whatset = 'set2';
year = _n_;
set banks;
whatset = 'set3';
year = _n_;
output;
run;
proc print data = newbank6;
title 'NewBank6 -- NO DO Loop 3 SET statements ONLY 1 OUTPUT';
run;
What did you expect to get from your original code with the SET inside the DO loop? Based on how SAS works, the DO Loop didn't get you much if all you wanted to do was read one record and write one record. The DATA step (as you can see by #2) is an implied looping structure. It reads every obs in the input dataset until the end of file is reached. In #2, for every obs that is read, an obs is written. You have to understand HOW to use the SET statement without a DO loop and how to use the OUTPUT statement without a DO loop before you start to put a SET statement inside a DO loop (IMO).
Cynthia
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the thorough response. It is very helpful. From all your examples, I am beginning to understand that there are two different loops going on in #1. That is, first, there's one loop for the do loop (year = 1 to year = 3). And then, second, there's the implicit loop of the data step. This reads all the observations in the input dataset (from the set statement) independent of the do loop. Please correct/clarify if my understanding is wrong. Otherwise, this has been very helpful. Thanks again.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
To further give you food for thought, run this:
data banks;
input name : $ 13. rate;
datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
fourthbank 0.05
;
run;
data newbank (drop = year);
do year=1 to 3;
set banks;
output;
end;
run;
proc print data = newbank;
run;
and try to explain to us what happens.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I see. Thank you for this as this is a good exercise. My explanation is that the do loop will go from year = 1 to year = 3 by 1s. In addition to this, there's the implicit loop of the data step. This will read the input dataset one observation at a time until the end of the dataset. Since there are four observations, the output dataset will also have four observations even though the do loop only goes from 1 to 3. So, even though the set statement is inside a do loop, the set statement will be controlled by the implicit data step loop. That is, all the observations in the set statement will only be read once.
Please correct/clarify my understanding as you see fit.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@danielhanbitlee wrote:
I see. Thank you for this as this is a good exercise. My explanation is that the do loop will go from year = 1 to year = 3 by 1s. In addition to this, there's the implicit loop of the data step. This will read the input dataset one observation at a time until the end of the dataset. Since there are four observations, the output dataset will also have four observations even though the do loop only goes from 1 to 3. So, even though the set statement is inside a do loop, the set statement will be controlled by the implicit data step loop. That is, all the observations in the set statement will only be read once.
Please correct/clarify my understanding as you see fit.
Most SAS data step do not stop at the end. Instead they stop when you read past the end of the input. So in your situation the data step is trying to read three observations per iteration. So the step iterates only two times. The first time it completes the DO loop, but on the second it stops the second time around the DO loop when it tries to read past the end of the input dataset.
This means that a normal data step with just a single SET or INPUT statement will iterate N+1 times. The last iteration will stop when it reads past the input and never make it to the end of the data step (where the implied OUTPUT statement runs).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Almost completely correct. The data step has an implicit "on end-of-file goto end", so as soon as the set statement tries to read past eof, it exits the loop without the error you'd get in another programming language.
The set is primarily controlled by the do loop, but it also has this "safety valve" that terminates the data step when the do loop makes its second iteration in the second iteration of the data step, where it tries to read a fifth observation that isn't there.
See this example:
data test;
put "_n_=" _n_;
put "eof before: " eof;
set sashelp.class end=eof;
put "eof after: " eof;
run;
You can see that the data step does 20 iterations, although sashelp.class only has 19 observations. And eof is set when the last observation is read, carries over, and is acted upon as soon as set tries to read past eof.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you for the response. I think I'm beginning to understand better. So, just to make sure, let me try to explain.
I will use the code that you gave previously, except I won't drop the variable "year" from the do loop:
data banks;
input name : $ 13. rate;
datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
fourthbank 0.05
;
run;
data newbank;
do year=1 to 3;
set banks;
output;
end;
run;
Output:
Obs year name rate
1 1 FirstCapital 0.0718
2 2 DirectBank 0.0721
3 3 VirtualDirect 0.0728
4 1 fourthbank 0.0500
Here's my explanation of what is going on (also I have a question in red font color😞
1. First iteration of data step
- First iteration of do loop (year = 1)
- First observation of the dataset is read and output is given (code: set banks; output;)
- Second iteration of do loop (year = 2)
- Second observation of the dataset is read and output is given (code: set banks; output;)
- Third iteration of do loop (year = 3)
- Third observation of the dataset is read and output is given (code: set banks; output;)
- End of do loop and end of first iteration of data step
2. Second iteration of data step
- First iteration of do loop (year = 1)
- Fourth observation of the dataset is read and output is given (code: set banks; output;)
- How does SAS know to read the fourth observation of the input dataset here? Is there some sort of a pointer?
- End of file reached. EOF = 1
- Fourth observation of the dataset is read and output is given (code: set banks; output;)
- Second iteration of do loop (year = 2)
- Terminate data step because EOF = 1. No output is given
Please correct/clarify.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
You are absolutely correct.
Regarding your question in red:
Each set (or merge) statement keeps its own pointer(s) throughout execution and carries them over from one data step iteration to the next. These pointers can be manipulated (see key= and point= options for the set statement), which can be the base for some tricky programming (but caution needs to be exercised).
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Got it. Makes more sense now. Thank you so much!