Solved: Re: How to join 3 tables with different by group in fewer steps?

Anna_nag · Posted 09-30-2014 02:01 PM

Hi All,

I need you suggestions to make my logic little efficient/faster since I am dealing with HUGE data.

I have one big table on which I am calculating average of a column twice (see below avgone and avgtwo dataset) by two different by groups.

Now I would like to add these two columns avg1 and avg2 back with big table (left join/master table) which has different by group to join with.

I know I can do it in several proc sql steps but can I do it in one step of proc sql/data step or preferably hash programming which is faster than several proc sqls?

Thanks in advance for your help.

/*master table*/

data big;

input A1 A2 A3 A4 Rank revenue ;

cards;

1 1 1 1 1 100

1 1 1 1 2 120

1 1 2 1 3 90

1 1 2 1 4 12

1 1 3 2 1 100

1 1 3 2 2 107

1 1 3 3 3 198

1 1 3 4 4 87

1 1 4 4 5 18

1 1 4 5 6 54

1 1 4 5 7 178

;

run;

/* average with different by group */

proc sql;

create table avgone as

select a1, a2, a3 , avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

/* average with different by group */

proc sql;

create table avgtwo as

select a1, a2, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

/* Now join avgone and avgtwo back with big table */

PGStats · Posted 09-30-2014 04:19 PM

SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!

The inclusion of summaries in the table sounds like poor modelling because it is redundant information and modifying your data will likely make your summaries inconsistent.

PG

View solution in original post

LinusH · Posted 09-30-2014 02:51 PM

You could avoid joining back the result to the detail data by having all column in the select clause. Don't know how efficient that will be in your case. And I get a little tired of always pointing out hash programming to solve performance issues. It seem often that you are reinventing the week. There are standard procedures and SQL for a reason. Do you really need to rejoin aggregate information to detail rows? What is the final goal/application? To me it sounds like bad modelling.

Data never sleeps

Anna_nag · Posted 09-30-2014 03:18 PM

I cannot have all column names in select statement since I am grouping it with smaller subset of columns... if I do that it will not summarize my data.

Are you saying hash programming is not going to solve performance issue?

My final goal is to add two new variables to the original big table and those two new variables are the average of revenue column calculated by different by group.

not sure why it is bad modeling..

PGStats · Posted 09-30-2014 04:19 PM

SAS SQL remerges summaries automatically when the select list contains columns not involved in the grouping. Try it!

The inclusion of summaries in the table sounds like poor modelling because it is redundant information and modifying your data will likely make your summaries inconsistent.

PG

Anna_nag · Posted 09-30-2014 04:29 PM

Got it..Thanks for your feedback..

Anotherdream · Posted 09-30-2014 04:35 PM

Linus sorry to jump into this thread but I am also curious to your and PGstats responses.

I do know that sql can have different select clauses over group by clauses, but since Anna_nag has to do it twice how would this work in one sql statement? Or are you saying to simply do it in two sql statements and then merge them (which becomes your big table) and then live with the performance issues?

Ex: I know you can do this, and this is a "solution" to the original posters question.

proc sql;

create table avgtwo as

select a1, a2,a3, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

Which would calculate the average over a1,a2, and a4 across the assocaited rows (including A3). however how would you then calculate the average of a1,a2,a3 without doing another create table and merge? I realize the merge is more efficient since the tables would share all variables and it would only be one merge. I'm just curious if this is what you were implying. So the final code would look like

proc sql;

create table avgone as

select a1, a2,a3, a4 , avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

proc sql;

create table avgtwo as

select a1, a2,a3, a4 , avg(revenue) as avg2

from big

group by a1,a2,a4;

quit;

proc sort data=avgone;

by a1 a2 a3 a4;

run;

proc sort data=avgtwo;

by a1 a2 a3 a4;

run;

data answer;

merge avgone avgtwo;

by a1 a2 a3 a4;

run;

PGStats · Posted 09-30-2014 04:46 PM

You could do:

proc sql;

create table bigger as

select *, mean(revenue) as rev124

from (select *, mean(revenue) as rev123 from big group by a1, a2, a3)

group by a1, a2, a4;

quit;

PG

LinusH · Posted 09-30-2014 04:59 PM

No problem Anotherdream.

No, not merge. First round calculate the first mean, sr´tore with detail data.

Second round, calculate the the second means from round one result table.

proc sql;

create table avgone as

select *, avg(revenue) as avg1

from big

group by a1,a2,a3;

quit;

/* average with different by group */

proc sql;

create table avgtwo as

select * , avg(revenue) as avg2

from avgone

group by a1,a2,a4;

quit;

Data never sleeps

Anotherdream · Posted 09-30-2014 05:05 PM

Ah that makes total sense. SOrry I did not conceptually get that. I am so not used to the select being able to be different from the group I never thought of a sub stepping solution like you and PG have.

Sorry for the confusion, that's actually a very cool trick.

Anna_nag · Posted 10-01-2014 12:07 PM

Thanks everyone.. Got some really good feedback from you all.

Astounding · Posted 09-30-2014 04:39 PM

This may prove irrelevant to your final solution, but notice ...

You are already processing your huge data set twice, to get the average revenue for two different groupings. A single PROC SUMMARY would produce both levels of summarization with a single pass through the data. The value of switching to PROC SUMMARY depends on how large your summarized tables are, relative to the original huge table. The best methods of joining the averages may also depend on the size of the summarized tables. How many records are you seeing in AVGONE and AVGTWO? (And how many in the original table BIG?)

SAS Innovate 2025: Call for Content

Classroom Training Available!