BookmarkSubscribeRSS Feed
SAS_dj1999
Calcite | Level 5

I found the following code to calculate the average (below).  I'm a beginner and would like to know if this code is optimal for datasources with over 4 million records?  The goal is to have the capability to write a data set that calculates the averages.  Again, this is my first time so I might be in the wrong forum.  Sorry.

data have;

input id month packs @@;

datalines

1 1 10 1 2 20 1 8 99 1 5 30 1 1 30

2 1 100 2 3 200 2 7 999 2 3 300 2 8 888

3 10 999 3 11 999

;

run;
quit;

data must;

total = 0;

n = 0;

do until(last.id)

set have;

by id

total + month;

n+ month;

end;

if n then average = total / n;

do until (last.id)

set have;

by id;

output;

end;

run;

13 REPLIES 13
Haikuo
Onyx | Level 15

Is there a particular reason that you choose not to use Proc summary or Proc SQL?

Haikuo

Tal
Pyrite | Level 9 Tal
Pyrite | Level 9

aren't total and n holding the same value throughout the entire program?

wouldn't he be better off using proc means mean?

SAS_dj1999
Calcite | Level 5

Hai.kuo,

I'd like to have the flexibility view various variables like:

Employee          Department      Salary     Type                        Number of Employees      Employee Average Salary      Last year's Average Salary   Difference/Ratio change

John Doe           Finance           25000      Mid Manager            20                                  20000                                    17000

Plus, using a data set procedure allows me to double check my work against a SQL procedure.

Thanks,

Vincent

Reeza
Super User

Depends on if your data is sorted and you need it re-merged.

If the data is not sorted and you don't need it remerged then this is not the most efficient way.

art297
Opal | Level 21

As a beginner, an important lesson to learn is NOT to omit the necessary semi-colons.  Your code wouldn't even run as posted.

Just as important, do you really want to calculate the average month or are you really trying to calculate the average packs?

Regardless, given that your data is already in ID order, yes it will run quicker than any of the procs.  However, the caveat is that you have to type all of the code.  When you are doing either multiple and/or more complex calculations, writing the code will probably take you longer than the running time you will save.

SAS_dj1999
Calcite | Level 5

Arthur,

This was just an example.  I was more so looking to see if I could run code like this for a datasource that contain millions of records.

Thanks,

art297
Opal | Level 21

Prior to posting my response, I expanded your example to have 4.5 million records.  Your code (with semicolons added) ran faster than proc means/summary and proc sql.

SAS_dj1999
Calcite | Level 5

Arthur,

Just to clarify your response, the code above does run faster?

Thanks,

art297
Opal | Level 21

Yes, the code you originally proposed runs faster.  The problem is the time required to write it which, for anything complicated, could easily take longer than the time saved and introduce more opportunity for error.

Reeza
Super User

Including a sort?

art297
Opal | Level 21

No, the OP's example data was already sorted.  However, to make coding even more complex, even if it weren't sorted, I would have to think that including a sort-via-hash would still be faster.

I'm not recommending the code-your-own approach, just answering the original question whether data step processing is more efficient (processing wise) than other methods.

Astounding
PROC Star

Is your data originally in a SAS data set, or are you first reading in a sorted raw data file?  If you are starting with the sorted raw data file, you could modify the first DATA step to produce your averages.

SAS_dj1999
Calcite | Level 5

First.  Arthur, Thanks!

Astounding, my data is a SQL server.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 13 replies
  • 1188 views
  • 0 likes
  • 6 in conversation