BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
happy_sas_kitty
Obsidian | Level 7

This question has been asked before but I did not understand the explanation.

Old link: https://communities.sas.com/t5/General-SAS-Programming/PROC-SQL/td-p/314917

The data is here:

data work.pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
run;

We want to get a result that looks like this:

Jobcode Salary Avg
------- ------ -----
PT1 50000 60000
PT1 70000 60000
PT1 60000 60000
PT2 80000 85000
PT2 90000 85000
PT3 100000 100000

It seems that the results come from average(salary) by group, ordered by id.

Why does the code below work? The part I don't understand is highlighted in red.

select
Jobcode,

Salary,
(select avg(Salary) from WORK.PILOTS as P1
where P1.Jobcode=P2.Jobcode) as Avg
from WORK.PILOTS as P2
order by Id
;

Here's what I'm thinking about the red part subquery:

select avg(salary) from work.pilots as p1

this gives us 1 value of avg(all salaries) in p1. It has no jobcode column. Thus the next line "where p1.jobcode = p2.jobcode" should not work.

This would make more sense:

select avg(Salary) from WORK.PILOTS as P1
group by jobcode

However this code, although giving us 3 averages for 3 jobcodes, does not have an one-to-one relationship with pilots.id

Can anyone explain step by step on how the codes marked in red works?

 

1 ACCEPTED SOLUTION

Accepted Solutions
Kurt_Bremser
Super User

You are mostly correct.

Wrong is this:

"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."

That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only

  • if missing values are present in an incoming table
  • or, in a (outer/left/right) join, one of the contributing tables has no match; then its values will be missing

But that's a minor detail.

 

"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"

SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:

 

create table _temp_ as
  select
    jobcode,
    avg(salary) as avg
  from work.pilots
;
create table want as
  select
    p.Jobcode,
    p.Salary,
    t.avg
  from work.pilots p left join _temp_ t
  on p.jobcode = t.jobcode
;

where _temp_ stands for a "hidden" table that exists only for the duration of the query.

 

But you may be right, and code like this could be (considerably) less performant than a separate summary:

data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;

proc summary data=pilots;
class jobcode;
var salary;
output
  out=avg (keep=jobcode avg)
  mean(salary)=avg
;
run;

data want;
set pilots;
if _n_ = 1
then do;
  length avg 8;
  declare hash a (dataset:"avg");
  a.definekey("jobcode");
  a.definedata("avg");
  a.definedone();
end;
if a.find() ne 0 then avg = .;
run;

This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.

 

View solution in original post

6 REPLIES 6
happy_sas_kitty
Obsidian | Level 7
somehow the "red" option does not work in "SAS code" area. and I can't edit my own post..
The part where I don't understand is this:
(select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode) as Avg
s_lassen
Meteorite | Level 14

The stuff in the paranthesis is a subquery; meaning that for jobcode PT1 the average salary for the first 3 rows (the ones having that jobcode) is selected.

 

So for those rows, the subquery is equivalent to 

select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'

which returns a single value - and so on for the other jobcodes.

 

Your GROUP BY suggestion would not work in such a subquery, as it would not return a single row. 

 

You could, in SAS SQL, use a GROUP BY to get the same result like this:

proc sql;
  select
    ID,
    Jobcode,
    Salary,
    avg(Salary) as avg 
    from WORK.PILOTS 
    group by Jobcode
    order by Id; 
quit;

The reason it is not done like that is probably that this type of query would not work in most other SQL dialects, whereas the subquery solution would.

happy_sas_kitty
Obsidian | Level 7
Thank you for the fast reply. I somehow still don't understand what's going on.
how did
(select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode) as Avg
transform into this:
select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'
I probably need a step by step explanation.

after reading your reply, my guess is, the whole thing happens like this:
the proc sql runs row by row on the outer query.
It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing.
on 1st row: it loads id, jobcode, salary from p2. It sees column Avg and tries to calculate it.
Avg is calculated from: (select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode)
since currently we are loading from p2 (because the from p2 clause) , we see that p2.jobcode on this 1st row is 'PT1'
Then, sas looks at "select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'" , and returns the avg value for jobcode='PT1', which is one single value, to avg column.
If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?
Kurt_Bremser
Super User

You are mostly correct.

Wrong is this:

"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."

That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only

  • if missing values are present in an incoming table
  • or, in a (outer/left/right) join, one of the contributing tables has no match; then its values will be missing

But that's a minor detail.

 

"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"

SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:

 

create table _temp_ as
  select
    jobcode,
    avg(salary) as avg
  from work.pilots
;
create table want as
  select
    p.Jobcode,
    p.Salary,
    t.avg
  from work.pilots p left join _temp_ t
  on p.jobcode = t.jobcode
;

where _temp_ stands for a "hidden" table that exists only for the duration of the query.

 

But you may be right, and code like this could be (considerably) less performant than a separate summary:

data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;

proc summary data=pilots;
class jobcode;
var salary;
output
  out=avg (keep=jobcode avg)
  mean(salary)=avg
;
run;

data want;
set pilots;
if _n_ = 1
then do;
  length avg 8;
  declare hash a (dataset:"avg");
  a.definekey("jobcode");
  a.definedata("avg");
  a.definedone();
end;
if a.find() ne 0 then avg = .;
run;

This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.

 

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 6 replies
  • 1144 views
  • 2 likes
  • 3 in conversation