Hello
I am trying to find the equivalent of Excel's large function which returns the k'th largest element in a list.
My problem is the following: "Given a dataset with, say, 10^6 observations and 25 variables return the kth largest element for variables x1 to x25."
I tried a few things with proc sort, proc rank, data steps (obs=, firstobs=), _N_, some sql code. I also tried percentiles, but none of the percentiles options (there are 5 different ones) do exactly correspond to the kth largest element.
Thus, none of the solutions I tried are really satisfying.
The main issues I have are:
1. I do not want to execute some code (say a macro) for each of the variables. Let's say I do not want to sort my data set 25 times. This is to slow.
2. Proc Rank seems to work reasonably fast and it can work on all of the variables at the same time. However, none of the TIES= options suits my needs, since I want the ranks to be 1,2,3,4,..., 999999, 1000000 (simply 1-n all integers without interleaving any numbers)
Can anybody help me with that?
I timed my code again. It turns out that sorting 25 times (for all different variables) takes about 9 seconds (using "sasfile ds open"). I think I can actually live with that. Even though proc rank can give me all the ranks in 2.5 seconds.
Bernhard
For now I have settled with a macro (see below).
I cannot get rid of the feelign that there is a nicer way to do this.
%macro large(data=ds,out=out,k=,vars=_all_);
/*returns the kth largest values*/
%local i nm;
proc datasets nolist;delete __temp:;quit;
data __temp__;set &data.(keep=&vars. obs=1);run;
proc contents noprint data=__temp__ out=__cont__(keep=name);quit;
proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;
sasfile &data. open;
%do i=1 %to %sysfunc(countw(&_varlist_.));
%let nm=%scan(&_varlist_.,&i.);
proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;
%do j=1 %to %sysfunc(countw(&k.));
%let kk=%scan(&k.,&j.);
%if &j. eq 1 %then %do;
data __temp___&i.;
format k 14.;
set __temp__(obs=&kk. firstobs=&kk.);
k=&kk.;
run;
%end;
%else %do;
data __temp&kk.;
format k 14.;
set __temp__(obs=&kk. firstobs=&kk.);
k=&kk.;
run;
proc append base=__temp___&i. data=__temp&kk.;quit;
%end;
%end;
%end;
data &out.;
merge __temp___:;
by k;
run;
sasfile &data. close;
%mend;
%large(data=import2,k=10 100 1000);
Hi Bernhard,
The largest function will identify the nth largest value from a set of variables you specify.
DATA WANT;
SET HAVE;
LARGEST = LARGEST(1,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25);
RUN;
Which can be consolidated to:
DATA WANT;
SET HAVE;
LARGEST = LARGEST(1,of X1-X25);
RUN;
You could also use an array to resort the values, that way the value in x5 is always the 5th largest of each observation.
DATA TWO;
KEEP REORDERED:;
SET HAVE;
ARRAY X(25);
ARRAY REORDERED(25);
DO K=1 TO 25;
REORDERED(K)=largest(K, OF X1-X25);
END;
RUN;
Regards,
Scott
Hi Scott
Thanks for your first answer. I have noticed this function as well. But it is not doing what I want.
Consider the dataset sashelp.cars. Say I want to find the 5th largest value of cylinders. The result should be 10. Since there are three observations with 12 cylinders and two with 10 cylinders.
If I take you code for this example, I get an error (the error is obvious to me since the function is built for a different use case)
DATA WANT;
SET sashelp.cars;
LARGEST = LARGEST(5,cylinders);
RUN;
I need to have a look at the second answer.....
For now I have settled with a macro (see below).
I cannot get rid of the feelign that there is a nicer way to do this.
%macro large(data=ds,out=out,k=,vars=_all_);
/*returns the kth largest values*/
%local i nm;
proc datasets nolist;delete __temp:;quit;
data __temp__;set &data.(keep=&vars. obs=1);run;
proc contents noprint data=__temp__ out=__cont__(keep=name);quit;
proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;
sasfile &data. open;
%do i=1 %to %sysfunc(countw(&_varlist_.));
%let nm=%scan(&_varlist_.,&i.);
proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;
%do j=1 %to %sysfunc(countw(&k.));
%let kk=%scan(&k.,&j.);
%if &j. eq 1 %then %do;
data __temp___&i.;
format k 14.;
set __temp__(obs=&kk. firstobs=&kk.);
k=&kk.;
run;
%end;
%else %do;
data __temp&kk.;
format k 14.;
set __temp__(obs=&kk. firstobs=&kk.);
k=&kk.;
run;
proc append base=__temp___&i. data=__temp&kk.;quit;
%end;
%end;
%end;
data &out.;
merge __temp___:;
by k;
run;
sasfile &data. close;
%mend;
%large(data=import2,k=10 100 1000);
Here is another method I found in my code repository. The previous example output to variables with a prefix of reordered. The below will sort and output to the variables with a X prefix.
DATA WANT;
SET HAVE;
ARRAY A{25} X1-X25;
** SET SORT PARAMETERS ;
SEQ = "A" ; * A = ASCENDING, D = DESCENDING ;
NODUPKEY = 0 ; * 0 = DUPLICATES ALLOWED, 1 = DUPLICATES NOT ALLOWED ;
DCL HASH HH (HASHEXP: 0, ORDERED: SEQ) ;
DCL HITER HI ("HH") ;
HH.DEFINEKEY ("K","N") ; * N - EXTRA ENUMERATING KEY ;
HH.DEFINEDATA ("K") ; * K AUTOMATICALLY ASSUMES ARRAY DATA TYPE ;
HH.DEFINEDONE ( ) ;
** LOAD COMPOSITE (K N) KEY ON THE TABLE ;
** IF DUPLICATES TO BE RETAINED, SET 0 <- N ;
DO J = LBOUND (A) TO HBOUND (A) ;
N = J * ^ NODUPKEY ;
K = A {J} ;
HH.REPLACE() ;
END ;
** USE ITERATOR HI TO RELOAD ARRAY FROM HH TABLE, NOW IN ORDER ;
N = LBOUND (A) - 1 ;
DO RC = HI.FIRST() BY 0 WHILE ( RC = 0 ) ;
N = N + 1 ;
A {N} = K ;
RC = HI.NEXT() ;
END ;
Q = N ;
** FILL ARRAY TAIL WITH MISSING VALUES IF DUPLICATES ARE DELETE ;
DO N = Q + 1 TO HBOUND (A) ;
A {N} = . ;
END ;
** CHECK IF ARRAY IS NOW SORTED ;
SORTED = 1 ;
DO N = LBOUND (A) + 1 TO Q WHILE ( SORTED ) ;
IF A {N - 1} > A {N} THEN SORTED = 0 ;
END ;
PUT SORTED = ;
KEEP X:;
RUN ;
Good Luck
Just out of curiosity would proc rank help?
EJ
Note quite. Although proc rank is very efficient, I have not found a way with proc rank to number all observations by 1..... n . Although it can handle tied ranks (where multiple observations have the same value) in various ways, none of the options for TIES= does what I want.
Depending on what K is, ie fixed or dynamic perhaps the extreme values section of Proc univariate can help:
Base SAS(R) 9.2 Procedures Guide: Statistical Procedures, Third Edition
Sadly ties are counted as multiple occurrences, so it may not help either.
title 'Extreme Blood Pressure Values';
ods select ExtremeValues;
proc univariate data=BPressure nextrval=5;
var Systolic Diastolic;
run;
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.