Solved: Re: Multiple observations from one record - understanding the logic

Syntas_error · Posted 03-16-2019 05:33 AM

This is from another SAS practice questoin.

We have a raw data file:

1001 77,164.19 76,804.75 74,384.27

1002 76,612.93 81,456.34 82,063.97

1003 82,185.16 79,742.33

We want to transform the raw data file into the following data set:

Store Sales Month

1001 77,164.19 1

1001 76,804.75 2

1001 74,384.27 3

1002 76,612.93 1

1002 81,456.34 2

1002 82,063.97 3

1003 82,185.16 1

1003 79,742.33 2

There are several alternatives, to choose from, the correct one being:

data perm.topstores;
infile sales98 missover;
input Store Sales : comma. @;
Month=0;
do while (sales ne .);
month + 1;
output;
input sales : comma. @;
end;
run;

Now, could someone walk me through what's happening in the different steps?

I think I understand the virst relevant line:

input Store Sales : comma. @;

This simply specifies the variables, the ':' alows us to use an informat for 'sales', so we can process those non-standard values. The @-sign should "lock the row into place" allowing multiple observations to be read from the same row. Though in my understanding, this would mean values from each row should be assigned to Store "and" Sales, so for the first row we would get something like:

Store Sales

1001 77,164.19

76,804.75 74,384.27

The next segment of code is hard to follow:

do while (sales ne .);
month + 1;
output;
input sales : comma. @;
end;

What exactly does "do while (sales ne .);", mean, do we execute the steps as long as the values of sales aren't missing?

How exacly does this step specify that the counter is to be reset for each new store?

Astounding · Posted 03-16-2019 09:12 AM

You've picked a good problem to deepen your understanding of how the DATA step works. Here's where your logic breaks down.

The INPUT statement is more than a roadmap on how to read the incoming lines of data. It's actually a set of instructions, which execute repeatedly (for each line of data. So for the first INPUT statement, and the first line of data, this is all you get:

Store Sales

1001 77164.19

Those values just sit in memory, awaiting further developments. Next, the software adds MONTH, with a value of 0. Then the DO loop begins, since SALES > . is true. It increases MONTH, giving you:

Store Sales Month

1001 77164.19 1

The OUTPUT statement executes next, copying these values from memory to the output data set. (Technically, I'm skipping a little bit of technical detail here, not important for our purposes.)

Then the second INPUT statement executes, giving you:

Store Sales Month

1001 76804.75 1

Again, those values are just sitting in memory. The DO loop continues, increasing MONTH to 2. Then the OUTPUT statement executes again, copying the current values from memory to the output data set.

The DO loop again checks the value of SALES and finds that it is greater than missing. So the DO loop executes again.

In a nutshell, those are the next principles to absorb, to increase your understanding of the DATA step.

View solution in original post

Kurt_Bremser · Posted 03-16-2019 08:13 AM

The missover option tells SAS to return a missing value when you try to read more items than are present on the current line.

The @ holds the input line.

So the do while keeps reading from the input line until the end is reached (input returns a missing), at which point it goes to the next data step iteration, which causes an automatic skip to the next input line.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Astounding · Posted 03-16-2019 09:12 AM

You've picked a good problem to deepen your understanding of how the DATA step works. Here's where your logic breaks down.

The INPUT statement is more than a roadmap on how to read the incoming lines of data. It's actually a set of instructions, which execute repeatedly (for each line of data. So for the first INPUT statement, and the first line of data, this is all you get:

Store Sales

1001 77164.19

Those values just sit in memory, awaiting further developments. Next, the software adds MONTH, with a value of 0. Then the DO loop begins, since SALES > . is true. It increases MONTH, giving you:

Store Sales Month

1001 77164.19 1

The OUTPUT statement executes next, copying these values from memory to the output data set. (Technically, I'm skipping a little bit of technical detail here, not important for our purposes.)

Then the second INPUT statement executes, giving you:

Store Sales Month

1001 76804.75 1

Again, those values are just sitting in memory. The DO loop continues, increasing MONTH to 2. Then the OUTPUT statement executes again, copying the current values from memory to the output data set.

The DO loop again checks the value of SALES and finds that it is greater than missing. So the DO loop executes again.

In a nutshell, those are the next principles to absorb, to increase your understanding of the DATA step.

Syntas_error · Posted 03-16-2019 11:12 AM

Okay, for that to be true a coule of things need to follow I think:

1. The inputed values needs to remain in the memory after they've been outputed.

2. the variable "sales" needs to be overwritten by the new variable with the same name.

3. Despite (2), the second input statement picks a new value (one step to the right) for the new variable.

Do I interpret this correctly?

Tom · Posted 03-16-2019 12:27 PM

Pretty much.

Values for the current observation are always "in memory" during the execution of a data step. That is why you can use the variables in operations.

Variables are not really "overwritten". They have values. The values vary across observations of the dataset and can vary during the execution of the code. Remember that SAS will compile your data step code before it starts running it. So the INPUT statement itself doesn't define a variable. I just reads values into the variable. It might appear to define the variable because allows lazy programming. You don't have to define your variables before using them. SAS will make a guess at how to define the variable based on how you first use it in the code.

SAS keeps a pointer to where on the line you are currently reading. (You can use the COLUMN= option on the INFILE statement to have it share that value as a variable in your data step.) So when you again read SALES it starts looking from where it left off after the last read (because of the trialing @ on the input statement).

Astounding · Posted 03-17-2019 09:33 AM

You interpret correctly. While this is an oversimplification, it helps to think of there being three storage locations:

The raw data line that the program reads from
The values in memory
The values written to the output data set

So to address your questions ...

The OUTPUT statement copies from values in memory to the output data set. It has no impact on the values in memory.

"Overwriting SALES" means changing the value of SALES in memory. In this case, that is accomplished by executing an INPUT statement that reads from the raw data line, and replaces the value in memory. So it's not a new variable with the same name ... it's the value in memory being replaced by a new value.

Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

Re: Multiple observations from one record - understanding the logic

SAS Innovate 2025: Save the Date