You are mostly correct.
Wrong is this:
"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."
That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only
if missing values are present in an incoming table
or, in a (outer/left/right) join, one of the contributing tables has no match; then its values will be missing
But that's a minor detail.
"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"
SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:
create table _temp_ as
select
jobcode,
avg(salary) as avg
from work.pilots
;
create table want as
select
p.Jobcode,
p.Salary,
t.avg
from work.pilots p left join _temp_ t
on p.jobcode = t.jobcode
;
where _temp_ stands for a "hidden" table that exists only for the duration of the query.
But you may be right, and code like this could be (considerably) less performant than a separate summary:
data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
proc summary data=pilots;
class jobcode;
var salary;
output
out=avg (keep=jobcode avg)
mean(salary)=avg
;
run;
data want;
set pilots;
if _n_ = 1
then do;
length avg 8;
declare hash a (dataset:"avg");
a.definekey("jobcode");
a.definedata("avg");
a.definedone();
end;
if a.find() ne 0 then avg = .;
run;
This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.
... View more