Averages within a data set

SAS_dj1999 · Posted 10-11-2012 03:16 PM

I found the following code to calculate the average (below). I'm a beginner and would like to know if this code is optimal for datasources with over 4 million records? The goal is to have the capability to write a data set that calculates the averages. Again, this is my first time so I might be in the wrong forum. Sorry.

data have;

input id month packs @@;

datalines

1 1 10 1 2 20 1 8 99 1 5 30 1 1 30

2 1 100 2 3 200 2 7 999 2 3 300 2 8 888

3 10 999 3 11 999

;

run;
quit;

data must;

total = 0;

n = 0;

do until(last.id)

set have;

by id

total + month;

n+ month;

end;

if n then average = total / n;

do until (last.id)

set have;

by id;

output;

end;

run;

Haikuo · Posted 10-11-2012 03:24 PM

Is there a particular reason that you choose not to use Proc summary or Proc SQL?

Haikuo

Tal · Posted 10-11-2012 03:52 PM

aren't total and n holding the same value throughout the entire program?

wouldn't he be better off using proc means mean?

SAS_dj1999 · Posted 10-11-2012 05:03 PM

Hai.kuo,

I'd like to have the flexibility view various variables like:

Employee Department Salary Type Number of Employees Employee Average Salary Last year's Average Salary Difference/Ratio change

John Doe Finance 25000 Mid Manager 20 20000 17000

Plus, using a data set procedure allows me to double check my work against a SQL procedure.

Thanks,

Vincent

Reeza · Posted 10-11-2012 03:29 PM

Depends on if your data is sorted and you need it re-merged.

If the data is not sorted and you don't need it remerged then this is not the most efficient way.

art297 · Posted 10-11-2012 05:06 PM

As a beginner, an important lesson to learn is NOT to omit the necessary semi-colons. Your code wouldn't even run as posted.

Just as important, do you really want to calculate the average month or are you really trying to calculate the average packs?

Regardless, given that your data is already in ID order, yes it will run quicker than any of the procs. However, the caveat is that you have to type all of the code. When you are doing either multiple and/or more complex calculations, writing the code will probably take you longer than the running time you will save.

SAS_dj1999 · Posted 10-11-2012 05:29 PM

Arthur,

This was just an example. I was more so looking to see if I could run code like this for a datasource that contain millions of records.

Thanks,

art297 · Posted 10-11-2012 05:34 PM

Prior to posting my response, I expanded your example to have 4.5 million records. Your code (with semicolons added) ran faster than proc means/summary and proc sql.

SAS_dj1999 · Posted 10-11-2012 05:42 PM

Arthur,

Just to clarify your response, the code above does run faster?

Thanks,

art297 · Posted 10-11-2012 06:24 PM

Yes, the code you originally proposed runs faster. The problem is the time required to write it which, for anything complicated, could easily take longer than the time saved and introduce more opportunity for error.

Reeza · Posted 10-11-2012 06:39 PM

Including a sort?

art297 · Posted 10-11-2012 06:57 PM

No, the OP's example data was already sorted. However, to make coding even more complex, even if it weren't sorted, I would have to think that including a sort-via-hash would still be faster.

I'm not recommending the code-your-own approach, just answering the original question whether data step processing is more efficient (processing wise) than other methods.

Astounding · Posted 10-11-2012 07:32 PM

Is your data originally in a SAS data set, or are you first reading in a sorted raw data file? If you are starting with the sorted raw data file, you could modify the first DATA step to produce your averages.

SAS_dj1999 · Posted 10-12-2012 10:23 AM

First. Arthur, Thanks!

Astounding, my data is a SQL server.

Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

Re: Averages within a data set

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away