Proc SQL

Hakob

How do you find the 3rd highest salary using a self-join?

Tom

@Hakob wrote:

How do you find the 3rd highest salary using a self-join?

Can you provide more context please?

Why SQL? There are better tools for this type of question.

Hakob

It was interview question.

PaigeMiller

@Hakob wrote:

How do you find the 3rd highest salary using a self-join?

Maxim 14 — Use the Right Tool. SQL is a poor choice. PROC RANK, or PROC SORT, will be easier to program and probably run faster.

--
Paige Miller

Hakob

It was interview question.

PaigeMiller

The answer I would give in the interview is the same answer I gave you. SQL is not the right tool for this.

--
Paige Miller

mkeintz

An interviewer insisting on PROC SQL code for the 3rd highest value is not attempting to assess your SAS expertise.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

dxiao2017

would that be using order by desc together with number? The output gives nth highest salary, not only the 3rd one. Something like this:

proc sql number;
select id, salary,var1,var2
   from tab1 left join
        tab2
   on tab1.id=tab2.id
   order by salary desc;
quit;

Ksharp

As others said, in SAS world, proc sql is not right tool to process this kind of question. Pick up the right tool like data step or proc univariate.

Anyway, here is an example SQL to solve this question.

data have;
 set sashelp.baseball;
 keep name salary;
run;

proc sql;
select distinct a.*
 from have as a,have as b
  where a.salary < b.salary
   group by a.salary
    having count(distinct b.salary)=2;
quit;


/*Check the result is right or not*/
ods select     ExtremeValues;
proc univariate data=have nextrval=3;
var salary;
run;

dxiao2017

Hi @Ksharp , I tested your code, it works good, I think this is the only way to solve @Hakob 's question, which in fact is, using a single step to select and output the 3rd largest value of a column, and it can be achieved only through sql self join. Any other techniques consume more than one step. I tried it using a sql step and a data step, which uses one more step than yours but the logic is simple and easier. All of the above mentioned code and output are as follows.

1) using one single step to select the 3rd largest value for a column, according to @Ksharp 's solution (a single proc sql self join step):

data have;
   input id salary;
   datalines;
1 12
2 34
3 43
4 32
5 21
;
run;
/*find the 3rd highest salary
using a self-join*/
proc sql;
select distinct a.*
   from have as a,
        have as b
   where a.salary<b.salary
   group by a.salary
   having count(distinct b.salary)=2;
quit;

2) using a proc sql and a data step:

proc sql;
create table salranks as
select id,
       salary
   from have
   order by salary desc;
select * from salranks;
quit;
data saltop3(keep=id salary);
   set salranks;
   n=_n_;
   if n=3;
run;
proc print data=saltop3;run;

dxiao2017

Hi Ksharp, I know your code perhaps is the only solution for the question, and I tried to understand in more details and code chunks how it produced the final result but was not able to understand. The code(sql self join) produces a cartesian product first (am I right about it) and what's next? Could you explain it, thanks very much!

Tom

The WHERE clause limits the cartesian product results that are considered to just those where the salary from a the second copy is larger than the current salary. Then the HAVING clause further limits the records to only those that have exactly two different salaries that are larger.

Which is ONE definition of the third largest. In this algorithm multiple similar salaries are counted as one. So in the series 100,100,95,95,95,90,80 it will pick 90.

Another definition might select 95 instead since there are two observations (both with salary=100) above it. That you could do even simpler in SQL if the goal is just for a HUMAN to see the answer (and not put the answer into a dataset or something a computer could use. Just use the OUTOBS= option of PROC SQL.

proc sql outobs=3;
select * 
  from have 
  order by salary desc
;
quit;

Note that with PROC RANK you can choose between these two and many other options for how to rank the values.

Ksharp

Tom,
Thumbs up!
This SQL solution is only suited for small data.
For big table, it would run into ages .
That is reason that picking up a right tool is so important.

Hakob

Hello, thank you for your answer.

dxiao2017

Hi @Hakob , because I remember there is a function can return the 1st, 2nd, and 3rd largest or smallest value, I searched all over the help document and find it is the largest(n, var) function, the code is as follows:

data salary;
   input id salary1-salary4;
   datalines;
1 12 34 50 70
2 34 45 52 12
3 43 12 36 22
4 32 29 80 37
5 21 10 90 43
;
run;
proc print data=salary;run;
data want;
   set salary;
   salary3rd=largest(3,of salary1-salary4);
run;
proc print data=want;run;

This function also works in proc sql (however, I cannot write the function as largest(3, of salary1-salary4), if I write it like this I got error message):

proc sql;
select largest(3,salary1,salary2,salary3,salary4)
       as salary3rd
   from salary;
quit;

One problem is if the dataset only have one salary column, if I want to use this function, I need to transpose the dataset first (not sure I am right about this or not) because the code for one column does not work and I got error message like this:

proc sql;
select largest(3,salary1)
       as salary3rd
   from salary;
quit;

69         proc sql;
 70         select largest(3,salary1)
 71                as salary3rd
 72            from salary;
 NOTE: Invalid argument 1 to function LARGEST. Missing values may be generated.
 73         quit;

Other techniques to find out the 3rd largest value may include: proc univariate, which by default gives 5 largest and smallest values in the output and proc sort.

Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

Re: Proc SQL

2025 SAS Hackathon: There is still time!

SAS Training: Just a Click Away