## Averages within a data set

Occasional Contributor
Posts: 7

# Averages within a data set

I found the following code to calculate the average (below).  I'm a beginner and would like to know if this code is optimal for datasources with over 4 million records?  The goal is to have the capability to write a data set that calculates the averages.  Again, this is my first time so I might be in the wrong forum.  Sorry.

data have;

input id month packs @@;

datalines

1 1 10 1 2 20 1 8 99 1 5 30 1 1 30

2 1 100 2 3 200 2 7 999 2 3 300 2 8 888

3 10 999 3 11 999

;

run;
quit;

data must;

total = 0;

n = 0;

do until(last.id)

set have;

by id

total + month;

n+ month;

end;

if n then average = total / n;

do until (last.id)

set have;

by id;

output;

end;

run;

Posts: 3,167

## Re: Averages within a data set

Is there a particular reason that you choose not to use Proc summary or Proc SQL?

Haikuo

Super Contributor
Posts: 464

## Re: Averages within a data set

aren't total and n holding the same value throughout the entire program?

wouldn't he be better off using proc means mean?

Occasional Contributor
Posts: 7

## Re: Averages within a data set

Hai.kuo,

I'd like to have the flexibility view various variables like:

Employee          Department      Salary     Type                        Number of Employees      Employee Average Salary      Last year's Average Salary   Difference/Ratio change

John Doe           Finance           25000      Mid Manager            20                                  20000                                    17000

Plus, using a data set procedure allows me to double check my work against a SQL procedure.

Thanks,

Vincent

Super User
Posts: 23,700

## Re: Averages within a data set

Depends on if your data is sorted and you need it re-merged.

If the data is not sorted and you don't need it remerged then this is not the most efficient way.

PROC Star
Posts: 8,164

## Re: Averages within a data set

As a beginner, an important lesson to learn is NOT to omit the necessary semi-colons.  Your code wouldn't even run as posted.

Just as important, do you really want to calculate the average month or are you really trying to calculate the average packs?

Regardless, given that your data is already in ID order, yes it will run quicker than any of the procs.  However, the caveat is that you have to type all of the code.  When you are doing either multiple and/or more complex calculations, writing the code will probably take you longer than the running time you will save.

Occasional Contributor
Posts: 7

## Re: Averages within a data set

Arthur,

This was just an example.  I was more so looking to see if I could run code like this for a datasource that contain millions of records.

Thanks,

PROC Star
Posts: 8,164

## Re: Averages within a data set

Prior to posting my response, I expanded your example to have 4.5 million records.  Your code (with semicolons added) ran faster than proc means/summary and proc sql.

Occasional Contributor
Posts: 7

## Re: Averages within a data set

Arthur,

Just to clarify your response, the code above does run faster?

Thanks,

PROC Star
Posts: 8,164

## Re: Averages within a data set

Yes, the code you originally proposed runs faster.  The problem is the time required to write it which, for anything complicated, could easily take longer than the time saved and introduce more opportunity for error.

Super User
Posts: 23,700

## Re: Averages within a data set

Including a sort?

PROC Star
Posts: 8,164

## Re: Averages within a data set

No, the OP's example data was already sorted.  However, to make coding even more complex, even if it weren't sorted, I would have to think that including a sort-via-hash would still be faster.

I'm not recommending the code-your-own approach, just answering the original question whether data step processing is more efficient (processing wise) than other methods.

Super User
Posts: 6,762

## Re: Averages within a data set

Is your data originally in a SAS data set, or are you first reading in a sorted raw data file?  If you are starting with the sorted raw data file, you could modify the first DATA step to produce your averages.

Occasional Contributor
Posts: 7