BookmarkSubscribeRSS Feed
Hakob
Fluorite | Level 6

How do you find the 3rd highest salary using a self-join?

16 REPLIES 16
Tom
Super User Tom
Super User

@Hakob wrote:

How do you find the 3rd highest salary using a self-join?


Can you provide more context please?

Why SQL?  There are better tools for this type of question.

Hakob
Fluorite | Level 6

It was interview question.

PaigeMiller
Diamond | Level 26

@Hakob wrote:

How do you find the 3rd highest salary using a self-join?


Maxim 14 — Use the Right Tool. SQL is a poor choice. PROC RANK, or PROC SORT, will be easier to program and probably run faster.

--
Paige Miller
Hakob
Fluorite | Level 6

It was interview question.

PaigeMiller
Diamond | Level 26

The answer I would give in the interview is the same answer I gave you. SQL is not the right tool for this.

--
Paige Miller
mkeintz
PROC Star

 

An interviewer insisting on PROC SQL code for the 3rd highest value is not attempting to assess your SAS expertise.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
dxiao2017
Lapis Lazuli | Level 10

would that be using order by desc together with number? The output gives nth highest salary, not only the 3rd one. Something like this: 

proc sql number;
select id, salary,var1,var2
   from tab1 left join
        tab2
   on tab1.id=tab2.id
   order by salary desc;
quit;
Ksharp
Super User

As others said, in SAS world, proc sql is not right tool to process this kind of question. Pick up the right tool like data step or proc univariate.

Anyway, here is an example SQL to solve this question.

 

data have;
 set sashelp.baseball;
 keep name salary;
run;

proc sql;
select distinct a.*
 from have as a,have as b
  where a.salary < b.salary
   group by a.salary
    having count(distinct b.salary)=2;
quit;


/*Check the result is right or not*/
ods select     ExtremeValues;
proc univariate data=have nextrval=3;
var salary;
run;

Ksharp_0-1757055169808.png

 

 

dxiao2017
Lapis Lazuli | Level 10

Hi @Ksharp , I tested your code, it works good, I think this is the only way to solve @Hakob 's question, which in fact is, using a single step to select and output the 3rd largest value of a column, and it can be achieved only through sql self join. Any other techniques consume more than one step. I tried it using a sql step and a data step, which uses one more step than yours but the logic is simple and easier. All of the above mentioned code and output are as follows.

 

1) using one single step to select the 3rd largest value for a column, according to @Ksharp 's solution (a single proc sql self join step):

data have;
   input id salary;
   datalines;
1 12
2 34
3 43
4 32
5 21
;
run;
/*find the 3rd highest salary
using a self-join*/
proc sql;
select distinct a.*
   from have as a,
        have as b
   where a.salary<b.salary
   group by a.salary
   having count(distinct b.salary)=2;
quit;

dxiao2017_0-1757066937456.png

 

2) using a proc sql and a data step:

proc sql;
create table salranks as
select id,
       salary
   from have
   order by salary desc;
select * from salranks;
quit;
data saltop3(keep=id salary);
   set salranks;
   n=_n_;
   if n=3;
run;
proc print data=saltop3;run;

dxiao2017_1-1757067137506.png

dxiao2017
Lapis Lazuli | Level 10
Hi Ksharp, I know your code perhaps is the only solution for the question, and I tried to understand in more details and code chunks how it produced the final result but was not able to understand. The code(sql self join) produces a cartesian product first (am I right about it) and what's next? Could you explain it, thanks very much!
Tom
Super User Tom
Super User

The WHERE clause limits the cartesian product results that are considered to just those where the salary from a the second copy is larger than the current salary.  Then the HAVING clause further limits the records to only those that have exactly two different salaries that are larger. 

 

 

Which is ONE definition of the third largest.  In this algorithm multiple similar salaries are counted as one.  So in the series 100,100,95,95,95,90,80 it will pick 90. 

 

Another definition might select 95 instead since there are  two observations (both with salary=100) above it.  That you could do even simpler in SQL if the goal is just for a HUMAN to see the answer (and not put the answer into a dataset or something a computer could use.  Just use the OUTOBS= option of PROC SQL.

proc sql outobs=3;
select * 
  from have 
  order by salary desc
;
quit;

Note that with PROC RANK you can choose between these two and many other options for how to rank the values.

Ksharp
Super User
Tom,
Thumbs up!
This SQL solution is only suited for small data.
For big table, it would run into ages .
That is reason that picking up a right tool is so important.
Hakob
Fluorite | Level 6

Hello, thank you for your answer.

dxiao2017
Lapis Lazuli | Level 10

Hi @Hakob , because I remember there is a function can return the 1st, 2nd, and 3rd largest or smallest value, I searched all over the help document and find it is the largest(n, var) function, the code is as follows:

data salary;
   input id salary1-salary4;
   datalines;
1 12 34 50 70
2 34 45 52 12
3 43 12 36 22
4 32 29 80 37
5 21 10 90 43
;
run;
proc print data=salary;run;
data want;
   set salary;
   salary3rd=largest(3,of salary1-salary4);
run;
proc print data=want;run;

dxiao2017_0-1757054807894.png

This function also works in proc sql (however, I cannot write the function as largest(3, of salary1-salary4), if I write it like this I got error message):

proc sql;
select largest(3,salary1,salary2,salary3,salary4)
       as salary3rd
   from salary;
quit;

dxiao2017_1-1757055218866.png

One problem is if the dataset only have one salary column, if I want to use this function, I need to transpose the dataset first (not sure I am right about this or not) because the code for one column does not work and I got error message like this:

proc sql;
select largest(3,salary1)
       as salary3rd
   from salary;
quit;
69         proc sql;
 70         select largest(3,salary1)
 71                as salary3rd
 72            from salary;
 NOTE: Invalid argument 1 to function LARGEST. Missing values may be generated.
 73         quit;

 

Other techniques to find out the 3rd largest value may include: proc univariate, which by default gives 5 largest and smallest values in the output and proc sort.

 

 

 

 

hackathon24-white-horiz.png

2025 SAS Hackathon: There is still time!

Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!

Register Now

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 16 replies
  • 7663 views
  • 17 likes
  • 6 in conversation