Re: nth max value using proc sql and datastep

pavank · Posted 11-28-2023 11:36 PM

data s;
input id price;
datalines;
100 25
100 60
100 65
100 80
100 91
101 55
101 25
101 85
101 100
;
run;

/* proc sql method : Second max price for each id */
proc sql;
    select *
    from s as a
    where  2=(select distinct price
        from s as b
        where a.price <= b.price
        ) ;
quit;

/* Datastep method :Second max price for each id */

data rank;
set s;
run;

proc sort data=s;
by descending id;
run;

data ranking;
set s;
by  descending id ;
if first.id then ranks=1;
else ranks+1;
proc print noobs;
run;


data maxvalue ;
set ranking  ;
if _n_=2;
proc print ;
run;

Q1.)Here I didn't get correct output for nth max value as of now i want to find second max price for each id how to fix this problem

Kurt_Bremser · Posted 11-29-2023 02:21 AM

You want the second highest price per id, so you need to sort by id and descending price.

Then do

data want;
set have;
by id;
if first.id
then count = 1;
else count + 1;
if count = 2;
drop count;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 11-29-2023 02:21 AM

You could use Proc Rank

data have;
  input id price;
datalines;
100 25
100 25
100 60
100 60
100 65
100 80
100 91
101 55
101 25
101 85
101 100
;

proc sort data=have out=have_sorted;
  by id;
run;

proc rank data=have_sorted ties=dense out=want(where=(rank=2));
  by id;
  var price;
  ranks rank;
run;

proc print data=want;
run;

pavank · Posted 11-29-2023 02:26 AM

But interviewer ask proc sql and datastep only not procedures now a days interviews ask most of secnario questions

PaigeMiller · Posted 11-29-2023 07:13 AM

@pavank wrote:
But interviewer ask proc sql and datastep only not procedures now a days interviews ask most of secnario questions

The correct answer to the interviewer is that you should not try to do this in PROC SQL when much simpler methods are available in SAS.

--
Paige Miller

s_lassen · Posted 11-29-2023 02:24 AM

I do not quite understand what you are trying to do with the SQL code.

A data step solution can be done like this:

proc sort data=s;
  by id descending price;
run;

data ranks;
  do rank=1 by 1 until(last.id);
    set s;
    by id;
    output;
    end;
run;

An then it just to use "where rank=2" if that is the rank you want.

s_lassen · Posted 11-29-2023 04:28 AM

SAS SQL is not really suited for this kind of question (other dialects have RANK summary functions, which can be used), so to get the second highest PRICE for each ID you will have to do something like

proc sql;
  select id,max(price) from(
    select * from s except
    select id,max(price) from s
    group by id)
  group by id;

in other words, subtract the rows with the highest values from the table in order to find the second highest. If you want to use SQL to find the third or fourth highest value, things begin to get complicated, with a larger number of nested queries.

Not that this solution will handle ties differently than the data step I wrote, as that will give different ranks to tied values - a data step that gives the same rank to ties can be written like this:

proc sort data=s;
  by id descending price;
run;

data ranks;
  rank=0;
  do _N_=1 by 1 until(last.id);
    set s;
    by id;
    if _N_=1 or price ne lag(price) then rank+1;
    output;
    end;
run;

But probably the most natural solution in SAS is to use PROC RANK, as suggested by @Patrick

Ksharp · Posted 11-29-2023 06:51 AM

Here is for PROC SQL solution.But I would not advocate it.

data s;
input id price;
datalines;
100 25
100 60
100 65
100 80
100 91
101 55
101 25
101 85
101 100
;
run;

%let n=2;
proc sql;
create table want as
select a.*
 from s as a,s as b
  where a.id=b.id and a.price<=b.price
   group by a.id,a.price
     having count(*)<=&n.;
quit;

pavank · Posted 12-01-2023 01:38 AM

Hi Sharp

Thanks for your solution

But interviewers ask reason why we can not achieved in proc sql can you explain how to explain proper reason

Kurt_Bremser · Posted 12-01-2023 01:54 AM

SQL does not have tools for sequences (with the exception of the undocumented MONOTONIC function, which does not work as expected under certain circumstances).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ksharp · Posted 12-01-2023 03:24 AM

As @Kurt_Bremser said PROC SQL is standard SQL (unlike ORACLE , DB2 ) can not handle the sequence records , the order of record does not honored by PROC SQL. That is why I do not recommend SQL for top nth MAX or MIN question . Date Step is right tool .

As you can see ,although I got solution by PROC SQL, that is still NOT a efficient way , if you have a big table , my PROC SQL would run forever .

PaigeMiller · Posted 12-01-2023 06:53 AM

I would say that PROC RANK ought to be preferred over a DATA step here. DATA steps require the user to create and then program the logic, not really that easy for most of us (even me). Whereas PROC RANK simply requires the normal information that most PROCs require: what data set, what variable(s), what output; the user does not have to create and then program the logic. Also, PROC RANK can handle ties; whereas handling ties in a DATA step requires the user to program additional logic.

--
Paige Miller

ballardw · Posted 12-01-2023 11:00 AM

Assuming the data set is amenable to transposing then Proc Transpose and Data step with the LARGEST function or CALL SORTN on an array of the transposed values would work depending on exactly how the output is desired.

/* assuming Have is output of transposed data by Id */
data ranks;
   set have;
   array c(*) col: ;
   do rank=1 to n(of c(*));
      value = largest(rank,of c(*));
      output;
   end;
   keep id rank value;
run;

Skeleton of program for relatively "nice" input data and ignores any tie breaking.

But first ask the interviewer about which rules for tie-breaking are to be used for ranking tied values. Can't start the process without all the constraints.

mkeintz · Posted 12-02-2023 12:42 AM

If the data are already sorted by ID, but not necessarily by price within ID, then a single DATA step as below:

data s;
input id price;
datalines;
100 25
100 60
100 65
100 80
100 91
101 55
101 25
101 85
101 100
run;

%let rank=2;

data nth_largest (label="for N=&rank");
  set s;
  by id;
  array larg {&rank} _temporary_;
  if price>larg{1} then do;
    larg{1}=price;
    call sort(of larg{*});
  end;
  if last.id;
  price=larg{1};
  call missing(of larg{*});
run;

The array LARG, after the call sort, will be sorted from smallest (larg{1}) to largest (larg{&rank}). Because the array is _temporary_, its values are retained across observations.

Additional note:

There has been some discussion of tied values. In the code above, if there are duplicate instances of high values, each instance is treated as a separate value in the sort call (i.e. for ID=100, if there were two instances of price=91, then 91 would be the 2nd largest, not 85).

But if you want to control for ties (i.e. you want the nth largest distinct value), just change

  if price>larg{1} then do;

to

  if price>larg{1} and whichn(price,of larg{*})=0 then do;

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Sajid01 · Posted 12-04-2023 11:18 AM

Hello
This is a common question in SAS interviews.
This is my solution. If there are duplicate values for the nth position, reports them.
May not be the most elegant or efficient approach but it does works for the interview.

data s;
input Id Price;
datalines;
100 25
100 60
100 65
100 80
100 91
101 55
101 25
101 85
101 100
103 100
104 91
;
run;
%Let N=2;
PROC SQL;
Select Id, Price from s s1
where (&N.-1)=(select count(Distinct Price) from s s2 where s2.Price> s1.Price);
QUIT;

SAS Innovate 2025: Save the Date

SAS Training: Just a Click Away