This is from another SAS practice questoin.
We have a raw data file:
1001 77,164.19 76,804.75 74,384.27
1002 76,612.93 81,456.34 82,063.97
1003 82,185.16 79,742.33
We want to transform the raw data file into the following data set:
Store Sales Month
1001 77,164.19 1
1001 76,804.75 2
1001 74,384.27 3
1002 76,612.93 1
1002 81,456.34 2
1002 82,063.97 3
1003 82,185.16 1
1003 79,742.33 2
There are several alternatives, to choose from, the correct one being:
data perm.topstores; infile sales98 missover; input Store Sales : comma. @; Month=0; do while (sales ne .); month + 1; output; input sales : comma. @; end; run;
Now, could someone walk me through what's happening in the different steps?
I think I understand the virst relevant line:
input Store Sales : comma. @;
Store Sales
1001 77,164.19
do while (sales ne .); month + 1; output; input sales : comma. @; end;
You've picked a good problem to deepen your understanding of how the DATA step works. Here's where your logic breaks down.
The INPUT statement is more than a roadmap on how to read the incoming lines of data. It's actually a set of instructions, which execute repeatedly (for each line of data. So for the first INPUT statement, and the first line of data, this is all you get:
Store Sales
1001 77164.19
Those values just sit in memory, awaiting further developments. Next, the software adds MONTH, with a value of 0. Then the DO loop begins, since SALES > . is true. It increases MONTH, giving you:
Store Sales Month
1001 77164.19 1
The OUTPUT statement executes next, copying these values from memory to the output data set. (Technically, I'm skipping a little bit of technical detail here, not important for our purposes.)
Then the second INPUT statement executes, giving you:
Store Sales Month
1001 76804.75 1
Again, those values are just sitting in memory. The DO loop continues, increasing MONTH to 2. Then the OUTPUT statement executes again, copying the current values from memory to the output data set.
The DO loop again checks the value of SALES and finds that it is greater than missing. So the DO loop executes again.
In a nutshell, those are the next principles to absorb, to increase your understanding of the DATA step.
The missover option tells SAS to return a missing value when you try to read more items than are present on the current line.
The @ holds the input line.
So the do while keeps reading from the input line until the end is reached (input returns a missing), at which point it goes to the next data step iteration, which causes an automatic skip to the next input line.
You've picked a good problem to deepen your understanding of how the DATA step works. Here's where your logic breaks down.
The INPUT statement is more than a roadmap on how to read the incoming lines of data. It's actually a set of instructions, which execute repeatedly (for each line of data. So for the first INPUT statement, and the first line of data, this is all you get:
Store Sales
1001 77164.19
Those values just sit in memory, awaiting further developments. Next, the software adds MONTH, with a value of 0. Then the DO loop begins, since SALES > . is true. It increases MONTH, giving you:
Store Sales Month
1001 77164.19 1
The OUTPUT statement executes next, copying these values from memory to the output data set. (Technically, I'm skipping a little bit of technical detail here, not important for our purposes.)
Then the second INPUT statement executes, giving you:
Store Sales Month
1001 76804.75 1
Again, those values are just sitting in memory. The DO loop continues, increasing MONTH to 2. Then the OUTPUT statement executes again, copying the current values from memory to the output data set.
The DO loop again checks the value of SALES and finds that it is greater than missing. So the DO loop executes again.
In a nutshell, those are the next principles to absorb, to increase your understanding of the DATA step.
Pretty much.
Values for the current observation are always "in memory" during the execution of a data step. That is why you can use the variables in operations.
Variables are not really "overwritten". They have values. The values vary across observations of the dataset and can vary during the execution of the code. Remember that SAS will compile your data step code before it starts running it. So the INPUT statement itself doesn't define a variable. I just reads values into the variable. It might appear to define the variable because allows lazy programming. You don't have to define your variables before using them. SAS will make a guess at how to define the variable based on how you first use it in the code.
SAS keeps a pointer to where on the line you are currently reading. (You can use the COLUMN= option on the INFILE statement to have it share that value as a variable in your data step.) So when you again read SALES it starts looking from where it left off after the last read (because of the trialing @ on the input statement).
You interpret correctly. While this is an oversimplification, it helps to think of there being three storage locations:
So to address your questions ...
The OUTPUT statement copies from values in memory to the output data set. It has no impact on the values in memory.
"Overwriting SALES" means changing the value of SALES in memory. In this case, that is accomplished by executing an INPUT statement that reads from the raw data line, and replaces the value in memory. So it's not a new variable with the same name ... it's the value in memory being replaced by a new value.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.
Find more tutorials on the SAS Users YouTube channel.