Solved: Re: Column sum using do until loop

smw10 · Posted 03-01-2017 09:05 AM

This is kind of a basic question. For reference, my data has 100s of observations and the following code narrows these observations down into each unique Dis PerformanceLevel pair. Why, when I use the following code, do I obtain an output that shows every unique Dis by Performance Level with a column sum of PL1 (the column sum is by Dis PerformanceLevel? I was looking for this result, but I don't understand why it is happening:

data SSR2 (keep=Dis PerformanceLevel sumPL1);
do i=1 by 1 until (last.PerformanceLevel);
set SSR1;
by Dis PerformanceLevel;
sumPL1=sum(sumPL1,PL1);
end;
output SSR2;
run;

Additionally, why, when I use the following code, do I not obtain the same result? The result is just my original data, not unique Dis PerformanceLevel pairs and this code outputs sumPL1 as row sums rather than column sums:

data SSR3 (keep=Dis PerformanceLevel sumPL1);
set SSR1;
by Dis PerformanceLevel;
sumPL1=sum(sumPL1,PL1);
run;

mkeintz · Posted 03-01-2017 09:37 AM

A data step is an iterative procedure. Every time a sas program reaches the bottom of the data step, it returns to the top. It typically stops only when there is an attempt to read past the end of a data set (i.e. when the SET statement has exhausted the incoming data set).

So when you do this:

data new;

set old;

sumpl=sum(sumpl1,pl1);

run;

there will be N(Old) iterations of the data step, where N(old) is number of observations in data set OLD. At the beginning of each iteration all newly created vars (ie. not inherited from OLD), are reset to missing. So sumpl1 is reset to missing N(old) times. In the middle of the data step you get the sum of that missing and PL1, i.e. sumpl1 just gets the value of pl1. At the bottom of each data step is an implicit OUTPUT statement, resulting in a new var, sumpl1, that simply duplicates the old var.

However, in this program

data ssr2;

do i=1 y 1 until (last.performancelevel);

set ssr1;

by dis performancelevel;

sumpl1=sum(sumpl1,pl1);

end;

output ssr2;

run;

you have an iterated do loop with a SET statement inside. This particular loop tells sas (1) to expect ssr1 to be sorted by dis/performancelevel, and (2) to stop looping when the last record for a given dis/performancelevel combination has been reached. Inside the do loop you are adding to sumpl1, without it being reset to missing. Then the output statement (superfluous in your case) writes out a record. How many record does it write out? Only as many as there are dis/performancelevel combinations. Only after a complete combination has been processed, does the data step reach bottom and iterate to the top, where sumpl1 is finaly reset to missing.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

View solution in original post

mkeintz · Posted 03-01-2017 09:15 AM

The variable sumpl1 is a "new" variable (i.e. not brought in by a SET or MERGE), and therefore is reset to missing at the top of every iteration of the data step. However, in your first program you only have as many data step iterations as there are dis/performancelevel combinations. But in your second program you have as many iterations of the data step as there are observations in SSR1.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

smw10 · Posted 03-01-2017 09:19 AM

Thank you for your response. I am still kind of confused. Is there any way you can dumb that down more for me?

art297 · Posted 03-01-2017 09:33 AM

To fully understand what goes on under the hood, you'll have to do some reading. Here is a good place to start:

http://support.sas.com/documentation/cdl/en/lrcon/62955/HTML/default/viewer.htm#a000961108.htm

You can control whether a sum retains its previous value by using a different form of the sum statement, namely one that automatically retains the value. E.g., the following datastep does the same thing as your initial one:

data SSR4 (keep=Dis PerformanceLevel sumPL1);
  set SSR1;
  by Dis PerformanceLevel;
  if first.performancelevel then sumPL1=PL1;
  else sumPL1+PL1;
  if last.performancelevel then output;
run;

HTH,

Art, CEO, AnalystFinder.com

smw10 · Posted 03-01-2017 10:43 AM

I read the link and it helped me understand what's happening in the background. Thank you!

mkeintz · Posted 03-01-2017 09:37 AM

A data step is an iterative procedure. Every time a sas program reaches the bottom of the data step, it returns to the top. It typically stops only when there is an attempt to read past the end of a data set (i.e. when the SET statement has exhausted the incoming data set).

So when you do this:

data new;

set old;

sumpl=sum(sumpl1,pl1);

run;

there will be N(Old) iterations of the data step, where N(old) is number of observations in data set OLD. At the beginning of each iteration all newly created vars (ie. not inherited from OLD), are reset to missing. So sumpl1 is reset to missing N(old) times. In the middle of the data step you get the sum of that missing and PL1, i.e. sumpl1 just gets the value of pl1. At the bottom of each data step is an implicit OUTPUT statement, resulting in a new var, sumpl1, that simply duplicates the old var.

However, in this program

data ssr2;

do i=1 y 1 until (last.performancelevel);

set ssr1;

by dis performancelevel;

sumpl1=sum(sumpl1,pl1);

end;

output ssr2;

run;

you have an iterated do loop with a SET statement inside. This particular loop tells sas (1) to expect ssr1 to be sorted by dis/performancelevel, and (2) to stop looping when the last record for a given dis/performancelevel combination has been reached. Inside the do loop you are adding to sumpl1, without it being reset to missing. Then the output statement (superfluous in your case) writes out a record. How many record does it write out? Only as many as there are dis/performancelevel combinations. Only after a complete combination has been processed, does the data step reach bottom and iterate to the top, where sumpl1 is finaly reset to missing.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

smw10 · Posted 03-01-2017 10:42 AM

Thank you for clarifying. I have accepted your response as the solution. I understand now that sumPL1 isn't being reset to missing until it reaches the last observation for a given Dis PerformanceLevel pair. So, I want to test out my logic and see if it makes sense to you:

The data set starts and it enters the loop that starts at 1 and increments by 1 observation until the last observation in the FIRST Dis PerformanceLevel pair is found. Next, it reads in the data that exists in SSR1. It then creates a variable named sumpl1 which intially starts as missing but is immediately given a value of the sumpl1+pl1. The loop then returns to the top of the loop and repeats until it reaches the last observation of the FIRST Dis PerformanceLevel pair. The loop ends and it outputs the data to SSR2. Then it returns to the top of the data step and enters the loop, iterating until the last observation of the SECOND Dis PerformanceLevel pair is found. sumpl1 is set to missing and is then given a value of sumpl1+pl1. etc...

Is this correct? Additionally, I have two final questions:

1. why does it matter that the set statement is inside of the loop?

2. how does it know to output the distinct Dis PerformanceLevel pairs? For example, my dataset contains the following:

Dis PerformanceLevel PL1

1 1 1

1 2 1

I'd expect the result to be (even though I don't want the result to be like this):

Dis PerformanceLevel PL1 sumPL1

1 1 1 1

1 1 1 2

1 1 1 3

1 2 1 1

1 2 1 2

But instead I get (which is the result I want):

Dis PerformanceLevel PL1 sumPL1

1 1 1 3

1 2 1 2

mkeintz · Posted 03-01-2017 11:23 AM

@smw10 wrote:

.Thank you for clarifying. I have accepted your response as the solution. I understand now that sumPL1 isn't being reset to missing until it reaches the last observation for a given Dis PerformanceLevel pair. So, I want to test out my logic and see if it makes sense to you:

The data set starts and it enters the loop that starts at 1 and increments by 1 observation until the last observation in the FIRST Dis PerformanceLevel pair is found. Next, it reads in the data that exists in SSR1. It then creates a variable named sumpl1 which initially starts as missing but is immediately given a value of the sumpl1+pl1. The loop then returns to the top of the loop and repeats until it reaches the last observation of the FIRST Dis PerformanceLevel pair. The loop ends and it outputs the data to SSR2. Then it returns to the top of the data step and enters the loop, iterating until the last observation of the SECOND Dis PerformanceLevel pair is found. sumpl1 is set to missing and is then given a value of sumpl1+pl1. etc...

Is this correct? Additionally, I have two final questions:

1. why does it matter that the set statement is inside of the loop?

2. how does it know to output the distinct Dis PerformanceLevel pairs? For example, my dataset contains the following:

Dis PerformanceLevel PL1

1 1 1

1 1 1

1 1 1

1 2 1

1 2 1

I'd expect the result to be (even though I don't want the result to be like this):

Dis PerformanceLevel PL1 sumPL1

1 1 1 1

1 1 1 2

1 1 1 3

1 2 1 1

1 2 1 2

But instead I get (which is the result I want):

Dis PerformanceLevel PL1 sumPL1

1 1 1 3

1 2 1 2

You are essentially correct in your understanding. However, I would modify "it enters the loop that starts at 1 and increments by 1 observation" to "it enters the loop that starts at I=1 and increments I by 1" Then because there is a "set" statement inside that loop, a single observation is read for each iteration of the do loop.

As to your questions:

"Why does it matter that the set statement is inside the loop?" In this case, precisely because it defers reset of sumpl1 to missing.

There are other benefits to this technique as well. For instance you could run the loop once to get the mean of some value, and then rerun the same loop a second time to (1) reread the same obs, (2) calculate, say, a deviation from the newly calculated mean, and (3) have an explicit output statement inside the 2nd do loop, resulting in the same number of new records as old records, but now including within-by-group deviation from the mean.

"How does it know to output once per dis/perfermancelevel". When there is no explicit output statement in the data step, then there is an implied OUTPUT statement at the bottom, i.e. just before the "run;" statement. In the case of the do-loop, this part of the data step only happens at the end of each dis/performancelevel group, i.e. once per dis/perf... group. If you had put an explicit OUTPUT statement inside the do loop, then you would get one output observation for each input observation, showing the gradual accumulation of the sumpl1 value. And there would be no implicit output at the bottom of the data step.

Finally you could have avoided the do loop entirely and still get the desired results with this more traditional data step program:

data want;

set have;

by dis performancelevel;

retain sumpl1;

if first.performancelevel then sumpl1=0;

sumpl1=sum(sumpl1,pl1);

if last.performancelevel;

run;

Here, the RETAIN statement tells sas to NOT reset sumpl1 to missing at the top of each data step iteration (avoiding the problem you encountered absent the do loop). The "if first.perfo...." statement identifies the beginning of each by group level, and resets sumpl1 to zero. Then it increments sumpl1 by PL1 for every incoming obs.

The "if last.performancelevel;" is a subsetting if statement (notice there is no "then" clause). This means only observations that satisfy the if-condition are passed down to subsequent statement. Which in turn means that only the last observation for each by-group is exposed to the implicit output at the end of the data step.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

smw10 · Posted 03-01-2017 11:41 AM

I am 99% of the way to understanding this all. I now understand how it creates a column sum and only outputs the last observation of the Dis/PerformanceLevel pair. The only thing I'm still confused on is the set statement being within the do loop. I'm sorry if I am being bothersome, it's just really confusing me how the set statement being inside the do-loop defers sumpl1 from intializing to missing. I did try to run the set statement before the do-loop, rather than within the loop, to see if I could figure out the difference and SAS wouldn't complete the task so I ended up stopping it after a few minutes.

mkeintz · Posted 03-01-2017 02:02 PM

Run this program, which takes the program in question and intersperses PUT statements which will report on the log the dynamically changes values of sumpl1, as well as the data step iteration value _N_ and the do loop iteration value I, with the corresponding values of SUMPL1. Look at how SUMPL1 changes from location A to B1 to B2 (multiple B1's and B2's per A) to C.

Run this with just 2 or 3 dis/performancegroup collections:

data ssr2;
  put / 'Location A: Data Step iteration=' _n_  dis= performancelevel= sumpl1= ;
  do i=1 by 1 until (last.performancelevel);
    put +4 'Location B1:  Data Step Iteration ' _n_= '  Do loop iteration' I=  sumpl1=;
    set ssr1;
    by dis performancelevel;
    sumpl1=sum(sumpl1,pl1);
    put +4 'Location B2:  Data Step Iteration ' _n_= '  Do loop iteration' I=  sumpl1= pl1=;
  end;
  put 'Location C: Data Step iteration ' _N_= dis= performancelevel= sumpl1=;
  output ssr2;
run;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

smw10 · Posted 03-01-2017 08:08 PM

This cleared everything up. Thanks for teaching me the "put" statement. If you don't mind me asking, why does placing the set statement before the loop result in a never ending loop?

mkeintz · Posted 03-01-2017 11:00 PM

Putting a SET statement in front of the do loop will not make for a never ending loop, but removing the set statement from inside the loop will do so. After all, how can the "UNTIL (last.performmancelevel)" condition (which terminates the loop) ever be met if no data observations are being read in?

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

smw10 · Posted 03-01-2017 11:59 PM

Maybe I'm not understanding how it works correctly. I thought if you set it beforehand that the do until loop knew to use the already set dataset since it was already read in.

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away