BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Anna_nag
Obsidian | Level 7

Hi All,

I need you suggestions to make my logic little efficient/faster since I am dealing with HUGE data.

I have one big table on which I am calculating average of a column twice (see below avgone and avgtwo dataset) by two different by groups.

Now I would like to add these two columns avg1 and avg2 back with big table (left join/master table) which has different by group to join with.

I know I can do it in several proc sql steps but can I do it in one step of proc sql/data step or preferably hash programming which is faster than several proc sqls?

Thanks in advance for your help.

/*master table*/

data big;

input A1   A2  A3  A4  Rank  revenue ;

cards;

1      1      1     1     1            100

1      1      1     1     2            120    

1      1      2     1     3            90

1      1      2     1     4            12

1      1      3     2     1            100

1      1      3     2     2            107

1      1      3     3     3            198

1      1      3     4     4            87  

1      1      4     4     5            18

1      1      4     5     6            54

1      1      4     5     7            178

;

run;

/* average with different by group */

proc sql;

create table avgone as

select a1, a2, a3 , avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

/* average with different by group */

proc sql;

create table avgtwo as

select a1, a2, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

/* Now join avgone and avgtwo back with big table */

1 ACCEPTED SOLUTION

Accepted Solutions
PGStats
Opal | Level 21

SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!

The inclusion of summaries in the table sounds like poor modelling because it is redundant information and  modifying your data will likely make your summaries inconsistent.

PG

PG

View solution in original post

10 REPLIES 10
LinusH
Tourmaline | Level 20

You could avoid joining back the result to the detail data by having all column in the select clause. Don't know how efficient that will be in your case. And I get a little tired of always pointing out hash programming to solve performance issues. It seem often that you are reinventing the week.  There are standard procedures and SQL for a reason. Do you really need to rejoin aggregate information to detail rows? What is the final goal/application?  To me it sounds like bad modelling.

Data never sleeps
Anna_nag
Obsidian | Level 7

I cannot have all column names in select statement since I am grouping it with smaller subset of columns... if I do that it will not summarize my data.

Are you saying hash programming is not going to solve performance issue?

My final goal is to add two new variables to the original big table and those two new variables are the average of revenue column calculated by different by group.

not sure why it is bad modeling..

PGStats
Opal | Level 21

SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!

The inclusion of summaries in the table sounds like poor modelling because it is redundant information and  modifying your data will likely make your summaries inconsistent.

PG

PG
Anna_nag
Obsidian | Level 7

Got it..Thanks for your feedback..

Anotherdream
Quartz | Level 8

Linus sorry to jump into this thread but I am also curious to your and PGstats responses.

I do know that sql can have different select clauses over group by clauses, but since Anna_nag has to do it twice how would this work in one sql statement? Or are you saying to simply do it in two sql statements and then merge them (which becomes your big table) and then live with the performance issues?

Ex: I know you can do this, and this is a "solution" to the original posters question.

proc sql;

create table avgtwo as

select a1, a2,a3, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

Which would calculate the average over a1,a2, and a4 across the assocaited rows (including A3). however how would you then calculate the average of a1,a2,a3 without doing another create table and merge?  I realize the merge is more efficient since the tables would share all variables and it would only be one merge. I'm just curious if this is what you were implying. So the final code would look like

proc sql;

create table avgone as

select a1, a2,a3, a4 , avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

proc sql;

create table avgtwo as

select a1, a2,a3, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

proc sort data=avgone;

by a1 a2 a3 a4;

run;

proc sort data=avgtwo;

by a1 a2 a3 a4;

run;

data answer;

merge avgone avgtwo;

by a1 a2 a3 a4;

run;

PGStats
Opal | Level 21

You could do:

proc sql;

create table bigger as

select *, mean(revenue) as rev124

from (select *, mean(revenue) as rev123 from big group by a1, a2, a3)

group by a1, a2, a4;

quit;

PG

PG
LinusH
Tourmaline | Level 20

No problem Anotherdream.

No, not merge. First round calculate the first mean, sr´tore with detail data.

Second round, calculate the the second means from round one result table.

proc sql;

create table avgone as

select *, avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

/* average with different by group */

proc sql;

create table avgtwo as

select * , avg(revenue) as avg2

from avgone

group by a1,a2,a4;

quit;

Data never sleeps
Anotherdream
Quartz | Level 8

Ah that makes total sense. SOrry I did not conceptually get that. I am so not used to the select being able to be different from the group I never thought of a sub stepping solution like you and PG have.

Sorry for the confusion, that's actually a very cool trick.

Anna_nag
Obsidian | Level 7

Thanks everyone.. Got some really good feedback from you all.

Astounding
PROC Star

This may prove irrelevant to your final solution, but notice ...

You are already processing your huge data set twice, to get the average revenue for two different groupings.  A single PROC SUMMARY would produce both levels of summarization with a single pass through the data.  The value of switching to PROC SUMMARY depends on how large your summarized tables are, relative to the original huge table.  The best methods of joining the averages may also depend on the size of the summarized tables.  How many records are you seeing in AVGONE and AVGTWO?  (And how many in the original table BIG?)

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 10 replies
  • 1240 views
  • 2 likes
  • 5 in conversation