BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
danielhanbitlee
Calcite | Level 5

I have the following dataset:

 

 

data banks;
	input name : $ 13. rate;
	datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;

I am comparing between two blocks of codes.

 

 

First code block:

 

data newbank (drop = year);
	do year=1 to 3;
		set banks;
		output;
	end;
run;
proc print data = newbank;
run;

 

 

Output to first code block:

 

Obs	name	rate
1	FirstCapital	0.0718
2	DirectBank	0.0721
3	VirtualDirect	0.0728

 

Second code block:

 

data newbank;
	set banks;
		output;
	set banks;
		output;
	set banks;
		output;
run;
proc print data = newbank;
run;

 

Output to second code block:

 

Obs	name	rate
1	FirstCapital	0.0718
2	FirstCapital	0.0718
3	FirstCapital	0.0718
4	DirectBank	0.0721
5	DirectBank	0.0721
6	DirectBank	0.0721
7	VirtualDirect	0.0728
8	VirtualDirect	0.0728
9	VirtualDirect	0.0728

 

Question: Why doesn't the first code block have the same output as the second code block? Why is there a difference if I use a set statement in a do loop vs using a set statement not in a do loop?

1 ACCEPTED SOLUTION

Accepted Solutions
Cynthia_sas
SAS Super FREQ

Hi:

  In Programming 1 (a free e-learning class) we discuss the concept of the Program Data Vector (PDV) and how SAS loads the input dataset information into the PDV and how information is written OUT of the PDV to the new dataset.

 

  Take a look at these alternate scenarios. I have changed your orginal code, so there are numbered output files, so that YEAR is kept and a new variable named WHATSET tells you what SET statement in the program supplied the variable information for that row. When you understand the PDV, you will understand why my #1 and my #2 are the same and only have 3 observations in the output dataset. And, when you understand the PDV, you will understand why the variable WHATSET is 'set3' in #6, but is 'set1' in #1 and #2.

set_do_loop.png

 

 

  #3 and #4 produce identical output -- 9 observations -- one program with a DO loop and one program without a DO loop. #5 also produces 9 observations, but with slightly different values for YEAR and WHATSET.

 

  Here's the code that produces all the outputs, including the 2 shown above.

data banks;
	input name : $ 13. rate;
	datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;

data newbank1;
	do year=1 to 3;
		set banks;
	    whatset='set1';
		output;
	end;
run;
proc print data = newbank1;
title 'NewBank1 -- SET inside DO loop';
run;

data newbank2;
    set banks;
	year = _n_;
	whatset = 'set1';
    output;
run;
proc print data = newbank2;
title 'NewBank2 -- NO DO loop at all';
run;


data newbank3;
    set banks;
	whatset = 'set1';
	year=1;
    output;
	year = 2;
	output;
	year = 3;
	output;
run;
proc print data = newbank3;
title 'NewBank3 -- NO DO loop but have 3 outputs';
run;



data newbank4;
    set banks;
	whatset = 'set1';
	do year=1 to 3;
		output;
	end;
run;
proc print data = newbank4;
title 'NewBank4 -- SET outside DO loop but OUTPUT inside DO loop';
run;


data newbank5;
     set banks ;
	    whatset = 'set1';
	    year = _n_;
		output;
	set banks ;
	    whatset = 'set2';
	    year = _n_;
		output;
	set banks;
	    whatset = 'set3';
	    year = _n_;
		output;
run;
proc print data = newbank5;
  title 'NewBank5 -- NO DO Loop 3 SET statements and 3 OUTPUTS';
run;


data newbank6;
     set banks ;
	    whatset = 'set1';
	    year = _n_;
	set banks ;
	    whatset = 'set2';
	    year = _n_;
	set banks;
	    whatset = 'set3';
	    year = _n_;
		output;
run;
proc print data = newbank6;
  title 'NewBank6 -- NO DO Loop 3 SET statements ONLY 1 OUTPUT';
run;

 

  What did you expect to get from your original code with the SET inside the DO loop? Based on how SAS works, the DO Loop didn't get you much if all you wanted to do was read one record and write one record. The DATA step (as you can see by #2) is an implied looping structure. It reads every obs in the input dataset until the end of file is reached. In #2, for every obs that is read, an obs is written. You have to understand HOW to use the SET statement without a DO loop and how to use the OUTPUT statement without a DO loop before you start to put a SET statement inside a DO loop (IMO).

 

Cynthia

View solution in original post

18 REPLIES 18
ChrisBrooks
Ammonite | Level 13

What's happening in the second code block is that you effectively have three versions of the bank data set open so it will write the first observation three times, once for each version of the data set and so on.

 

You sometimes see the same data set open more than once in a data step but it is a usually in the context of a DOW loop http://support.sas.com/resources/papers/proceedings12/052-2012.pdf

 

(editor's note: removed trailing blank from link)

danielhanbitlee
Calcite | Level 5

I see. What about in the first code block? Don't I have the data set open three times as well since the set statement is inside a DO loop?

danielhanbitlee
Calcite | Level 5

Also, can you resend the pdf link? It doesn't seem to work for me.

ChrisBrooks
Ammonite | Level 13

No because in the first code block you only have one SET statement - the first time it executes it opens the data set and reads the first observation, the second time it reads the second observation etc. In the second code block you have three SET statements, each of which opens the data set and reads the first statement and so on.

 

I'll try the link again http://support.sas.com/resources/papers/proceedings12/052-2012.pdf

danielhanbitlee
Calcite | Level 5

Ah got it. That was helpful. Thank you again!

 

I would accept this as a solution as well but I'm not sure how to have multiple answers as solutions.

Cynthia_sas
SAS Super FREQ

Hi:

  In Programming 1 (a free e-learning class) we discuss the concept of the Program Data Vector (PDV) and how SAS loads the input dataset information into the PDV and how information is written OUT of the PDV to the new dataset.

 

  Take a look at these alternate scenarios. I have changed your orginal code, so there are numbered output files, so that YEAR is kept and a new variable named WHATSET tells you what SET statement in the program supplied the variable information for that row. When you understand the PDV, you will understand why my #1 and my #2 are the same and only have 3 observations in the output dataset. And, when you understand the PDV, you will understand why the variable WHATSET is 'set3' in #6, but is 'set1' in #1 and #2.

set_do_loop.png

 

 

  #3 and #4 produce identical output -- 9 observations -- one program with a DO loop and one program without a DO loop. #5 also produces 9 observations, but with slightly different values for YEAR and WHATSET.

 

  Here's the code that produces all the outputs, including the 2 shown above.

data banks;
	input name : $ 13. rate;
	datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
;
run;

data newbank1;
	do year=1 to 3;
		set banks;
	    whatset='set1';
		output;
	end;
run;
proc print data = newbank1;
title 'NewBank1 -- SET inside DO loop';
run;

data newbank2;
    set banks;
	year = _n_;
	whatset = 'set1';
    output;
run;
proc print data = newbank2;
title 'NewBank2 -- NO DO loop at all';
run;


data newbank3;
    set banks;
	whatset = 'set1';
	year=1;
    output;
	year = 2;
	output;
	year = 3;
	output;
run;
proc print data = newbank3;
title 'NewBank3 -- NO DO loop but have 3 outputs';
run;



data newbank4;
    set banks;
	whatset = 'set1';
	do year=1 to 3;
		output;
	end;
run;
proc print data = newbank4;
title 'NewBank4 -- SET outside DO loop but OUTPUT inside DO loop';
run;


data newbank5;
     set banks ;
	    whatset = 'set1';
	    year = _n_;
		output;
	set banks ;
	    whatset = 'set2';
	    year = _n_;
		output;
	set banks;
	    whatset = 'set3';
	    year = _n_;
		output;
run;
proc print data = newbank5;
  title 'NewBank5 -- NO DO Loop 3 SET statements and 3 OUTPUTS';
run;


data newbank6;
     set banks ;
	    whatset = 'set1';
	    year = _n_;
	set banks ;
	    whatset = 'set2';
	    year = _n_;
	set banks;
	    whatset = 'set3';
	    year = _n_;
		output;
run;
proc print data = newbank6;
  title 'NewBank6 -- NO DO Loop 3 SET statements ONLY 1 OUTPUT';
run;

 

  What did you expect to get from your original code with the SET inside the DO loop? Based on how SAS works, the DO Loop didn't get you much if all you wanted to do was read one record and write one record. The DATA step (as you can see by #2) is an implied looping structure. It reads every obs in the input dataset until the end of file is reached. In #2, for every obs that is read, an obs is written. You have to understand HOW to use the SET statement without a DO loop and how to use the OUTPUT statement without a DO loop before you start to put a SET statement inside a DO loop (IMO).

 

Cynthia

danielhanbitlee
Calcite | Level 5

Thank you for the thorough response. It is very helpful. From all your examples, I am beginning to understand that there are two different loops going on in #1. That is, first, there's one loop for the do loop (year = 1 to year = 3). And then, second, there's the implicit loop of the data step. This reads all the observations in the input dataset (from the set statement) independent of the do loop. Please correct/clarify if my understanding is wrong. Otherwise, this has been very helpful. Thanks again.

Kurt_Bremser
Super User

To further give you food for thought, run this:

data banks;
	input name : $ 13. rate;
	datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
fourthbank 0.05
;
run;

data newbank (drop = year);
	do year=1 to 3;
		set banks;
		output;
	end;
run;

proc print data = newbank;
run;

and try to explain to us what happens.

danielhanbitlee
Calcite | Level 5

I see. Thank you for this as this is a good exercise. My explanation is that the do loop will go from year = 1 to year = 3 by 1s. In addition to this, there's the implicit loop of the data step. This will read the input dataset one observation at a time until the end of the dataset. Since there are four observations, the output dataset will also have four observations even though the do loop only goes from 1 to 3. So, even though the set statement is inside a do loop, the set statement will be controlled by the implicit data step loop. That is, all the observations in the set statement will only be read once.

 

Please correct/clarify my understanding as you see fit.

Tom
Super User Tom
Super User

@danielhanbitlee wrote:

I see. Thank you for this as this is a good exercise. My explanation is that the do loop will go from year = 1 to year = 3 by 1s. In addition to this, there's the implicit loop of the data step. This will read the input dataset one observation at a time until the end of the dataset. Since there are four observations, the output dataset will also have four observations even though the do loop only goes from 1 to 3. So, even though the set statement is inside a do loop, the set statement will be controlled by the implicit data step loop. That is, all the observations in the set statement will only be read once.

 

Please correct/clarify my understanding as you see fit.


Most SAS data step do not stop at the end. Instead they stop when you read past the end of the input. So in your situation the data step is trying to read three observations per iteration. So the step iterates only two times. The first time it completes the DO loop, but on the second it stops the second time around the DO loop when it tries to read past the end of the input dataset.

 

This means that a normal data step with just a single SET or INPUT statement will iterate N+1 times.   The last iteration will stop when it reads past the input and never make it to the end of the data step (where the implied OUTPUT statement runs).

Kurt_Bremser
Super User

Almost completely correct. The data step has an implicit "on end-of-file goto end", so as soon as the set statement tries to read past eof, it exits the loop without the error you'd get in another programming language.

The set is primarily controlled by the do loop, but it also has this "safety valve" that terminates the data step when the do loop makes its second iteration in the second iteration of the data step, where it tries to read a fifth observation that isn't there.

See this example:

data test;
put "_n_=" _n_;
put "eof before: " eof;
set sashelp.class end=eof;
put "eof after: " eof;
run;

You can see that the data step does 20 iterations, although sashelp.class only has 19 observations. And eof is set when the last observation is read, carries over, and is acted upon as soon as set tries to read past eof.

 

danielhanbitlee
Calcite | Level 5

Thank you for the response. I think I'm beginning to understand better. So, just to make sure, let me try to explain.

 

I will use the code that you gave previously, except I won't drop the variable "year" from the do loop:

 

 

data banks;
	input name : $ 13. rate;
	datalines;
FirstCapital 0.0718
DirectBank 0.0721
VirtualDirect 0.0728
fourthbank 0.05
;
run;

data newbank;
	do year=1 to 3;
		set banks;
		output;
	end;
run;

Output:

 

 

 

Obs	year	name	rate
1	1	FirstCapital	0.0718
2	2	DirectBank	0.0721
3	3	VirtualDirect	0.0728
4	1	fourthbank	0.0500

Here's my explanation of what is going on (also I have a question in red font color😞

 

 

1. First iteration of data step

  1. First iteration of do loop (year = 1)
    1. First observation of the dataset is read and output is given (code: set banks; output;)
  2. Second iteration of do loop (year = 2)
    1. Second observation of the dataset is read and output is given (code: set banks; output;)
  3. Third iteration of do loop (year = 3)
    1. Third observation of the dataset is read and output is given (code: set banks; output;)
  4. End of do loop and end of first iteration of data step

2. Second iteration of data step

  1. First iteration of do loop (year = 1)
    1. Fourth observation of the dataset is read and output is given (code: set banks; output;)
      • How does SAS know to read the fourth observation of the input dataset here? Is there some sort of a pointer?
    2. End of file reached. EOF = 1
  2. Second iteration of do loop (year = 2)
    1. Terminate data step because EOF  = 1. No output is given

Please correct/clarify.

Kurt_Bremser
Super User

You are absolutely correct.

Regarding your question in red:

Each set (or merge) statement keeps its own pointer(s) throughout execution and carries them over from one data step iteration to the next. These pointers can be manipulated (see key= and point= options for the set statement), which can be the base for some tricky programming (but caution needs to be exercised).

danielhanbitlee
Calcite | Level 5

Got it. Makes more sense now. Thank you so much!

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 18 replies
  • 5342 views
  • 1 like
  • 7 in conversation