I found the following code to calculate the average (below). I'm a beginner and would like to know if this code is optimal for datasources with over 4 million records? The goal is to have the capability to write a data set that calculates the averages. Again, this is my first time so I might be in the wrong forum. Sorry.
data have;
input id month packs @@;
datalines
1 1 10 1 2 20 1 8 99 1 5 30 1 1 30
2 1 100 2 3 200 2 7 999 2 3 300 2 8 888
3 10 999 3 11 999
;
run;
quit;
data must;
total = 0;
n = 0;
do until(last.id)
set have;
by id
total + month;
n+ month;
end;
if n then average = total / n;
do until (last.id)
set have;
by id;
output;
end;
run;
Is there a particular reason that you choose not to use Proc summary or Proc SQL?
Haikuo
aren't total and n holding the same value throughout the entire program?
wouldn't he be better off using proc means mean?
Hai.kuo,
I'd like to have the flexibility view various variables like:
Employee Department Salary Type Number of Employees Employee Average Salary Last year's Average Salary Difference/Ratio change
John Doe Finance 25000 Mid Manager 20 20000 17000
Plus, using a data set procedure allows me to double check my work against a SQL procedure.
Thanks,
Vincent
Depends on if your data is sorted and you need it re-merged.
If the data is not sorted and you don't need it remerged then this is not the most efficient way.
As a beginner, an important lesson to learn is NOT to omit the necessary semi-colons. Your code wouldn't even run as posted.
Just as important, do you really want to calculate the average month or are you really trying to calculate the average packs?
Regardless, given that your data is already in ID order, yes it will run quicker than any of the procs. However, the caveat is that you have to type all of the code. When you are doing either multiple and/or more complex calculations, writing the code will probably take you longer than the running time you will save.
Arthur,
This was just an example. I was more so looking to see if I could run code like this for a datasource that contain millions of records.
Thanks,
Prior to posting my response, I expanded your example to have 4.5 million records. Your code (with semicolons added) ran faster than proc means/summary and proc sql.
Arthur,
Just to clarify your response, the code above does run faster?
Thanks,
Yes, the code you originally proposed runs faster. The problem is the time required to write it which, for anything complicated, could easily take longer than the time saved and introduce more opportunity for error.
Including a sort?
No, the OP's example data was already sorted. However, to make coding even more complex, even if it weren't sorted, I would have to think that including a sort-via-hash would still be faster.
I'm not recommending the code-your-own approach, just answering the original question whether data step processing is more efficient (processing wise) than other methods.
Is your data originally in a SAS data set, or are you first reading in a sorted raw data file? If you are starting with the sorted raw data file, you could modify the first DATA step to produce your averages.
First. Arthur, Thanks!
Astounding, my data is a SQL server.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.