This question has been asked before but I did not understand the explanation.
Old link: https://communities.sas.com/t5/General-SAS-Programming/PROC-SQL/td-p/314917
The data is here:
data work.pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
run;
We want to get a result that looks like this:
Jobcode Salary Avg
------- ------ -----
PT1 50000 60000
PT1 70000 60000
PT1 60000 60000
PT2 80000 85000
PT2 90000 85000
PT3 100000 100000
It seems that the results come from average(salary) by group, ordered by id.
Why does the code below work? The part I don't understand is highlighted in red.
select
Jobcode,
Salary,
(select avg(Salary) from WORK.PILOTS as P1
where P1.Jobcode=P2.Jobcode) as Avg
from WORK.PILOTS as P2
order by Id
;
Here's what I'm thinking about the red part subquery:
select avg(salary) from work.pilots as p1
this gives us 1 value of avg(all salaries) in p1. It has no jobcode column. Thus the next line "where p1.jobcode = p2.jobcode" should not work.
This would make more sense:
select avg(Salary) from WORK.PILOTS as P1
group by jobcode
However this code, although giving us 3 averages for 3 jobcodes, does not have an one-to-one relationship with pilots.id
Can anyone explain step by step on how the codes marked in red works?
You are mostly correct.
Wrong is this:
"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."
That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only
But that's a minor detail.
"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"
SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:
create table _temp_ as
select
jobcode,
avg(salary) as avg
from work.pilots
;
create table want as
select
p.Jobcode,
p.Salary,
t.avg
from work.pilots p left join _temp_ t
on p.jobcode = t.jobcode
;
where _temp_ stands for a "hidden" table that exists only for the duration of the query.
But you may be right, and code like this could be (considerably) less performant than a separate summary:
data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
proc summary data=pilots;
class jobcode;
var salary;
output
out=avg (keep=jobcode avg)
mean(salary)=avg
;
run;
data want;
set pilots;
if _n_ = 1
then do;
length avg 8;
declare hash a (dataset:"avg");
a.definekey("jobcode");
a.definedata("avg");
a.definedone();
end;
if a.find() ne 0 then avg = .;
run;
This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.
It is a sub-select that calculates the avg for the jobcode currently encountered in the main SELECT.
The sub-select is necessary in SQL dialects that do not allow re-merging. Since SAS SQL does allow this, the simpler code presented by @s_lassen will solve your issue.
The stuff in the paranthesis is a subquery; meaning that for jobcode PT1 the average salary for the first 3 rows (the ones having that jobcode) is selected.
So for those rows, the subquery is equivalent to
select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'
which returns a single value - and so on for the other jobcodes.
Your GROUP BY suggestion would not work in such a subquery, as it would not return a single row.
You could, in SAS SQL, use a GROUP BY to get the same result like this:
proc sql;
select
ID,
Jobcode,
Salary,
avg(Salary) as avg
from WORK.PILOTS
group by Jobcode
order by Id;
quit;
The reason it is not done like that is probably that this type of query would not work in most other SQL dialects, whereas the subquery solution would.
You are mostly correct.
Wrong is this:
"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."
That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only
But that's a minor detail.
"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"
SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:
create table _temp_ as
select
jobcode,
avg(salary) as avg
from work.pilots
;
create table want as
select
p.Jobcode,
p.Salary,
t.avg
from work.pilots p left join _temp_ t
on p.jobcode = t.jobcode
;
where _temp_ stands for a "hidden" table that exists only for the duration of the query.
But you may be right, and code like this could be (considerably) less performant than a separate summary:
data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
proc summary data=pilots;
class jobcode;
var salary;
output
out=avg (keep=jobcode avg)
mean(salary)=avg
;
run;
data want;
set pilots;
if _n_ = 1
then do;
length avg 8;
declare hash a (dataset:"avg");
a.definekey("jobcode");
a.definedata("avg");
a.definedone();
end;
if a.find() ne 0 then avg = .;
run;
This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.