Hi All,
I need you suggestions to make my logic little efficient/faster since I am dealing with HUGE data.
I have one big table on which I am calculating average of a column twice (see below avgone and avgtwo dataset) by two different by groups.
Now I would like to add these two columns avg1 and avg2 back with big table (left join/master table) which has different by group to join with.
I know I can do it in several proc sql steps but can I do it in one step of proc sql/data step or preferably hash programming which is faster than several proc sqls?
Thanks in advance for your help.
/*master table*/
data big;
input A1 A2 A3 A4 Rank revenue ;
cards;
1 1 1 1 1 100
1 1 1 1 2 120
1 1 2 1 3 90
1 1 2 1 4 12
1 1 3 2 1 100
1 1 3 2 2 107
1 1 3 3 3 198
1 1 3 4 4 87
1 1 4 4 5 18
1 1 4 5 6 54
1 1 4 5 7 178
;
run;
/* average with different by group */
proc sql;
create table avgone as
select a1, a2, a3 , avg(revenue) as avg1
from big
group by a1,a2,a3;
quit;
/* average with different by group */
proc sql;
create table avgtwo as
select a1, a2, a4 , avg(revenue) as avg2
from big
group by a1,a2,a4;
quit;
/* Now join avgone and avgtwo back with big table */
SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!
The inclusion of summaries in the table sounds like poor modelling because it is redundant information and modifying your data will likely make your summaries inconsistent.
PG
You could avoid joining back the result to the detail data by having all column in the select clause. Don't know how efficient that will be in your case. And I get a little tired of always pointing out hash programming to solve performance issues. It seem often that you are reinventing the week. There are standard procedures and SQL for a reason. Do you really need to rejoin aggregate information to detail rows? What is the final goal/application? To me it sounds like bad modelling.
I cannot have all column names in select statement since I am grouping it with smaller subset of columns... if I do that it will not summarize my data.
Are you saying hash programming is not going to solve performance issue?
My final goal is to add two new variables to the original big table and those two new variables are the average of revenue column calculated by different by group.
not sure why it is bad modeling..
SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!
The inclusion of summaries in the table sounds like poor modelling because it is redundant information and modifying your data will likely make your summaries inconsistent.
PG
Got it..Thanks for your feedback..
Linus sorry to jump into this thread but I am also curious to your and PGstats responses.
I do know that sql can have different select clauses over group by clauses, but since Anna_nag has to do it twice how would this work in one sql statement? Or are you saying to simply do it in two sql statements and then merge them (which becomes your big table) and then live with the performance issues?
Ex: I know you can do this, and this is a "solution" to the original posters question.
proc sql;
create table avgtwo as
select a1, a2,a3, a4 , avg(revenue) as avg2
from big
group by a1,a2,a4;
quit;
Which would calculate the average over a1,a2, and a4 across the assocaited rows (including A3). however how would you then calculate the average of a1,a2,a3 without doing another create table and merge? I realize the merge is more efficient since the tables would share all variables and it would only be one merge. I'm just curious if this is what you were implying. So the final code would look like
proc sql;
create table avgone as
select a1, a2,a3, a4 , avg(revenue) as avg1
from big
group by a1,a2,a3;
quit;
proc sql;
create table avgtwo as
select a1, a2,a3, a4 , avg(revenue) as avg2
from big
group by a1,a2,a4;
quit;
proc sort data=avgone;
by a1 a2 a3 a4;
run;
proc sort data=avgtwo;
by a1 a2 a3 a4;
run;
data answer;
merge avgone avgtwo;
by a1 a2 a3 a4;
run;
You could do:
proc sql;
create table bigger as
select *, mean(revenue) as rev124
from (select *, mean(revenue) as rev123 from big group by a1, a2, a3)
group by a1, a2, a4;
quit;
PG
No problem Anotherdream.
No, not merge. First round calculate the first mean, sr´tore with detail data.
Second round, calculate the the second means from round one result table.
proc sql;
create table avgone as
select *, avg(revenue) as avg1
from big
group by a1,a2,a3;
quit;
/* average with different by group */
proc sql;
create table avgtwo as
select * , avg(revenue) as avg2
from avgone
group by a1,a2,a4;
quit;
Ah that makes total sense. SOrry I did not conceptually get that. I am so not used to the select being able to be different from the group I never thought of a sub stepping solution like you and PG have.
Sorry for the confusion, that's actually a very cool trick.
Thanks everyone.. Got some really good feedback from you all.
This may prove irrelevant to your final solution, but notice ...
You are already processing your huge data set twice, to get the average revenue for two different groupings. A single PROC SUMMARY would produce both levels of summarization with a single pass through the data. The value of switching to PROC SUMMARY depends on how large your summarized tables are, relative to the original huge table. The best methods of joining the averages may also depend on the size of the summarized tables. How many records are you seeing in AVGONE and AVGTWO? (And how many in the original table BIG?)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.