Solved: Re: Old SAS proc sql question needs explanation

happy_sas_kitty · Posted 08-27-2020 01:45 AM

This question has been asked before but I did not understand the explanation.

Old link: https://communities.sas.com/t5/General-SAS-Programming/PROC-SQL/td-p/314917

The data is here:

data work.pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;
run;

We want to get a result that looks like this:

Jobcode Salary Avg
------- ------ -----
PT1 50000 60000
PT1 70000 60000
PT1 60000 60000
PT2 80000 85000
PT2 90000 85000
PT3 100000 100000

It seems that the results come from average(salary) by group, ordered by id.

Why does the code below work? The part I don't understand is highlighted in red.

select
Jobcode,
Salary,
(select avg(Salary) from WORK.PILOTS as P1
where P1.Jobcode=P2.Jobcode) as Avg
from WORK.PILOTS as P2
order by Id
;

Here's what I'm thinking about the red part subquery:

select avg(salary) from work.pilots as p1

this gives us 1 value of avg(all salaries) in p1. It has no jobcode column. Thus the next line "where p1.jobcode = p2.jobcode" should not work.

This would make more sense:

select avg(Salary) from WORK.PILOTS as P1
group by jobcode

However this code, although giving us 3 averages for 3 jobcodes, does not have an one-to-one relationship with pilots.id

Can anyone explain step by step on how the codes marked in red works?

Kurt_Bremser · Posted 08-27-2020 04:07 AM

You are mostly correct.

Wrong is this:

"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."

That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only

if missing values are present in an incoming table
or, in a (outer/left/right) join, one of the contributing tables has no match; then its values will be missing

But that's a minor detail.

"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"

SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:

create table _temp_ as
  select
    jobcode,
    avg(salary) as avg
  from work.pilots
;
create table want as
  select
    p.Jobcode,
    p.Salary,
    t.avg
  from work.pilots p left join _temp_ t
  on p.jobcode = t.jobcode
;

where _temp_ stands for a "hidden" table that exists only for the duration of the query.

But you may be right, and code like this could be (considerably) less performant than a separate summary:

data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;

proc summary data=pilots;
class jobcode;
var salary;
output
  out=avg (keep=jobcode avg)
  mean(salary)=avg
;
run;

data want;
set pilots;
if _n_ = 1
then do;
  length avg 8;
  declare hash a (dataset:"avg");
  a.definekey("jobcode");
  a.definedata("avg");
  a.definedone();
end;
if a.find() ne 0 then avg = .;
run;

This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

View solution in original post

happy_sas_kitty · Posted 08-27-2020 01:48 AM

somehow the "red" option does not work in "SAS code" area. and I can't edit my own post..
The part where I don't understand is this:
(select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode) as Avg

Kurt_Bremser · Posted 08-27-2020 02:30 AM

It is a sub-select that calculates the avg for the jobcode currently encountered in the main SELECT.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Kurt_Bremser · Posted 08-27-2020 02:33 AM

The sub-select is necessary in SQL dialects that do not allow re-merging. Since SAS SQL does allow this, the simpler code presented by @s_lassen will solve your issue.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

s_lassen · Posted 08-27-2020 02:20 AM

The stuff in the paranthesis is a subquery; meaning that for jobcode PT1 the average salary for the first 3 rows (the ones having that jobcode) is selected.

So for those rows, the subquery is equivalent to

select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'

which returns a single value - and so on for the other jobcodes.

Your GROUP BY suggestion would not work in such a subquery, as it would not return a single row.

You could, in SAS SQL, use a GROUP BY to get the same result like this:

proc sql;
  select
    ID,
    Jobcode,
    Salary,
    avg(Salary) as avg 
    from WORK.PILOTS 
    group by Jobcode
    order by Id; 
quit;

The reason it is not done like that is probably that this type of query would not work in most other SQL dialects, whereas the subquery solution would.

happy_sas_kitty · Posted 08-27-2020 03:25 AM

Thank you for the fast reply. I somehow still don't understand what's going on.
how did
(select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode) as Avg
transform into this:
select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'
I probably need a step by step explanation.

after reading your reply, my guess is, the whole thing happens like this:
the proc sql runs row by row on the outer query.
It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing.
on 1st row: it loads id, jobcode, salary from p2. It sees column Avg and tries to calculate it.
Avg is calculated from: (select avg(Salary) from WORK.PILOTS as P1where P1.Jobcode=P2.Jobcode)
since currently we are loading from p2 (because the from p2 clause) , we see that p2.jobcode on this 1st row is 'PT1'
Then, sas looks at "select avg(Salary) from WORK.PILOTS as P1 where P1.Jobcode='PT1'" , and returns the avg value for jobcode='PT1', which is one single value, to avg column.
If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?

Kurt_Bremser · Posted 08-27-2020 04:07 AM

You are mostly correct.

Wrong is this:

"It sees the "from pilots as p2" first, so it sets up the row(pdv?) id, jobcode, salary, avg and sets all to missing."

That's what a data step does, in SQL the values are always taken from the incoming data; missing values occur only

if missing values are present in an incoming table
or, in a (outer/left/right) join, one of the contributing tables has no match; then its values will be missing

But that's a minor detail.

"If I get this correctly, for jobcode='PT1', sas calculated this avg(salary) 3 times?"

SAS SQL does some optimization behind the scenes, so (that's a guess) it should see this construct and summarize once, like this:

create table _temp_ as
  select
    jobcode,
    avg(salary) as avg
  from work.pilots
;
create table want as
  select
    p.Jobcode,
    p.Salary,
    t.avg
  from work.pilots p left join _temp_ t
  on p.jobcode = t.jobcode
;

where _temp_ stands for a "hidden" table that exists only for the duration of the query.

But you may be right, and code like this could be (considerably) less performant than a separate summary:

data pilots;
infile datalines;
input id name $ Jobcode $ Salary;
datalines;
001 Albert PT1 50000
002 Brenda PT1 70000
003 Carl PT1 60000
004 Donna PT2 80000
005 Edward PT2 90000
006 Flora PT3 100000
;

proc summary data=pilots;
class jobcode;
var salary;
output
  out=avg (keep=jobcode avg)
  mean(salary)=avg
;
run;

data want;
set pilots;
if _n_ = 1
then do;
  length avg 8;
  declare hash a (dataset:"avg");
  a.definekey("jobcode");
  a.definedata("avg");
  a.definedone();
end;
if a.find() ne 0 then avg = .;
run;

This method completely avoids any sorting (all "sorted" tables are built solely in memory) and will be the most performant way to tackle your issue.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away