- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How do you find the 3rd highest salary using a self-join?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Hakob wrote:
How do you find the 3rd highest salary using a self-join?
Can you provide more context please?
Why SQL? There are better tools for this type of question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It was interview question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It was interview question.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The answer I would give in the interview is the same answer I gave you. SQL is not the right tool for this.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
An interviewer insisting on PROC SQL code for the 3rd highest value is not attempting to assess your SAS expertise.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
would that be using order by desc together with number? The output gives nth highest salary, not only the 3rd one. Something like this:
proc sql number;
select id, salary,var1,var2
from tab1 left join
tab2
on tab1.id=tab2.id
order by salary desc;
quit;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
As others said, in SAS world, proc sql is not right tool to process this kind of question. Pick up the right tool like data step or proc univariate.
Anyway, here is an example SQL to solve this question.
data have;
set sashelp.baseball;
keep name salary;
run;
proc sql;
select distinct a.*
from have as a,have as b
where a.salary < b.salary
group by a.salary
having count(distinct b.salary)=2;
quit;
/*Check the result is right or not*/
ods select ExtremeValues;
proc univariate data=have nextrval=3;
var salary;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Ksharp , I tested your code, it works good, I think this is the only way to solve @Hakob 's question, which in fact is, using a single step to select and output the 3rd largest value of a column, and it can be achieved only through sql self join. Any other techniques consume more than one step. I tried it using a sql step and a data step, which uses one more step than yours but the logic is simple and easier. All of the above mentioned code and output are as follows.
1) using one single step to select the 3rd largest value for a column, according to @Ksharp 's solution (a single proc sql self join step):
data have;
input id salary;
datalines;
1 12
2 34
3 43
4 32
5 21
;
run;
/*find the 3rd highest salary
using a self-join*/
proc sql;
select distinct a.*
from have as a,
have as b
where a.salary<b.salary
group by a.salary
having count(distinct b.salary)=2;
quit;
2) using a proc sql and a data step:
proc sql;
create table salranks as
select id,
salary
from have
order by salary desc;
select * from salranks;
quit;
data saltop3(keep=id salary);
set salranks;
n=_n_;
if n=3;
run;
proc print data=saltop3;run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
The WHERE clause limits the cartesian product results that are considered to just those where the salary from a the second copy is larger than the current salary. Then the HAVING clause further limits the records to only those that have exactly two different salaries that are larger.
Which is ONE definition of the third largest. In this algorithm multiple similar salaries are counted as one. So in the series 100,100,95,95,95,90,80 it will pick 90.
Another definition might select 95 instead since there are two observations (both with salary=100) above it. That you could do even simpler in SQL if the goal is just for a HUMAN to see the answer (and not put the answer into a dataset or something a computer could use. Just use the OUTOBS= option of PROC SQL.
proc sql outobs=3;
select *
from have
order by salary desc
;
quit;
Note that with PROC RANK you can choose between these two and many other options for how to rank the values.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thumbs up!
This SQL solution is only suited for small data.
For big table, it would run into ages .
That is reason that picking up a right tool is so important.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello, thank you for your answer.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi @Hakob , because I remember there is a function can return the 1st, 2nd, and 3rd largest or smallest value, I searched all over the help document and find it is the largest(n, var) function, the code is as follows:
data salary;
input id salary1-salary4;
datalines;
1 12 34 50 70
2 34 45 52 12
3 43 12 36 22
4 32 29 80 37
5 21 10 90 43
;
run;
proc print data=salary;run;
data want;
set salary;
salary3rd=largest(3,of salary1-salary4);
run;
proc print data=want;run;
This function also works in proc sql (however, I cannot write the function as largest(3, of salary1-salary4), if I write it like this I got error message):
proc sql;
select largest(3,salary1,salary2,salary3,salary4)
as salary3rd
from salary;
quit;
One problem is if the dataset only have one salary column, if I want to use this function, I need to transpose the dataset first (not sure I am right about this or not) because the code for one column does not work and I got error message like this:
proc sql;
select largest(3,salary1)
as salary3rd
from salary;
quit;
69 proc sql; 70 select largest(3,salary1) 71 as salary3rd 72 from salary; NOTE: Invalid argument 1 to function LARGEST. Missing values may be generated. 73 quit;
Other techniques to find out the 3rd largest value may include: proc univariate, which by default gives 5 largest and smallest values in the output and proc sort.