10-11-2012 03:16 PM
I found the following code to calculate the average (below). I'm a beginner and would like to know if this code is optimal for datasources with over 4 million records? The goal is to have the capability to write a data set that calculates the averages. Again, this is my first time so I might be in the wrong forum. Sorry.
input id month packs @@;
1 1 10 1 2 20 1 8 99 1 5 30 1 1 30
2 1 100 2 3 200 2 7 999 2 3 300 2 8 888
3 10 999 3 11 999
total = 0;
n = 0;
total + month;
if n then average = total / n;
do until (last.id)
10-11-2012 05:03 PM
I'd like to have the flexibility view various variables like:
Employee Department Salary Type Number of Employees Employee Average Salary Last year's Average Salary Difference/Ratio change
John Doe Finance 25000 Mid Manager 20 20000 17000
Plus, using a data set procedure allows me to double check my work against a SQL procedure.
10-11-2012 03:29 PM
Depends on if your data is sorted and you need it re-merged.
If the data is not sorted and you don't need it remerged then this is not the most efficient way.
10-11-2012 05:06 PM
As a beginner, an important lesson to learn is NOT to omit the necessary semi-colons. Your code wouldn't even run as posted.
Just as important, do you really want to calculate the average month or are you really trying to calculate the average packs?
Regardless, given that your data is already in ID order, yes it will run quicker than any of the procs. However, the caveat is that you have to type all of the code. When you are doing either multiple and/or more complex calculations, writing the code will probably take you longer than the running time you will save.
10-11-2012 05:34 PM
Prior to posting my response, I expanded your example to have 4.5 million records. Your code (with semicolons added) ran faster than proc means/summary and proc sql.
10-11-2012 06:24 PM
Yes, the code you originally proposed runs faster. The problem is the time required to write it which, for anything complicated, could easily take longer than the time saved and introduce more opportunity for error.
10-11-2012 06:57 PM
No, the OP's example data was already sorted. However, to make coding even more complex, even if it weren't sorted, I would have to think that including a sort-via-hash would still be faster.
I'm not recommending the code-your-own approach, just answering the original question whether data step processing is more efficient (processing wise) than other methods.
10-11-2012 07:32 PM
Is your data originally in a SAS data set, or are you first reading in a sorted raw data file? If you are starting with the sorted raw data file, you could modify the first DATA step to produce your averages.