BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bernhard
Calcite | Level 5

Hello

I am trying to find the equivalent of Excel's large function which returns the k'th largest element in a list.

My problem is the following: "Given a dataset with, say, 10^6 observations and 25 variables return the kth largest element for variables x1 to x25."

I tried a few things with proc sort, proc rank, data steps (obs=, firstobs=), _N_, some sql code. I also tried percentiles, but none of the percentiles options (there are 5 different ones) do exactly correspond to the kth largest element.

Thus, none of the solutions I tried are really satisfying.

The main issues I have are:

1. I do not want to execute some code (say a macro) for each of the variables. Let's say I do not want to sort my data set 25 times. This is to slow.

2. Proc Rank seems to work reasonably fast and it can work on all of the variables at the same time. However, none of the TIES= options suits my needs, since I want the ranks to be 1,2,3,4,..., 999999, 1000000 (simply 1-n all integers without interleaving any numbers)

Can anybody help me with that?

I timed my code again. It turns out that sorting 25 times (for all different variables) takes about 9 seconds (using "sasfile ds open"). I think I can actually live with that. Even though proc rank can give me all the ranks in 2.5 seconds.

Bernhard

1 ACCEPTED SOLUTION

Accepted Solutions
bernhard
Calcite | Level 5

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

  %let nm=%scan(&_varlist_.,&i.);

  proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

  %do j=1 %to %sysfunc(countw(&k.));

  %let kk=%scan(&k.,&j.);

  %if &j. eq 1 %then %do;

  data __temp___&i.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  %end;

  %else %do;

  data __temp&kk.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  proc append base=__temp___&i. data=__temp&kk.;quit;

  %end;

  %end;

%end;

data &out.;

  merge __temp___:;

  by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

View solution in original post

7 REPLIES 7
Scott_Mitchell
Quartz | Level 8

Hi Bernhard,

The largest function will identify the nth largest value from a set of variables you specify.

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25);

RUN;

Which can be consolidated to:

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,of X1-X25);

RUN;

You could also use an array to resort the values, that way the value in x5 is always the 5th largest of each observation.

DATA TWO;

  KEEP REORDERED:;

  SET HAVE;

  ARRAY X(25);

  ARRAY REORDERED(25);

  DO K=1 TO 25;

    REORDERED(K)=largest(K, OF X1-X25);

  END;

RUN;

Regards,

Scott

bernhard
Calcite | Level 5

Hi Scott

Thanks for your first answer. I have noticed this function as well. But it is not doing what I want.

Consider the dataset sashelp.cars. Say I want to find the 5th largest value of cylinders. The result should be 10. Since there are three observations with 12 cylinders and two with 10 cylinders.

If I take you code for this example, I get an error (the error is obvious to me since the function is built for a different use case)

DATA WANT;

SET sashelp.cars;

  LARGEST = LARGEST(5,cylinders);

RUN;

I need to have  a look at the second answer.....

bernhard
Calcite | Level 5

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

  %let nm=%scan(&_varlist_.,&i.);

  proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

  %do j=1 %to %sysfunc(countw(&k.));

  %let kk=%scan(&k.,&j.);

  %if &j. eq 1 %then %do;

  data __temp___&i.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  %end;

  %else %do;

  data __temp&kk.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  proc append base=__temp___&i. data=__temp&kk.;quit;

  %end;

  %end;

%end;

data &out.;

  merge __temp___:;

  by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

Scott_Mitchell
Quartz | Level 8

Here is another method I found in my code repository.  The previous example output to variables with a prefix of reordered.  The below will sort and output to the variables with a X prefix.

DATA WANT;

SET HAVE;

ARRAY A{25} X1-X25;

** SET SORT PARAMETERS ;

SEQ = "A" ; * A = ASCENDING, D = DESCENDING ;

NODUPKEY = 0 ; * 0 = DUPLICATES ALLOWED, 1 = DUPLICATES NOT ALLOWED ;

DCL HASH HH (HASHEXP: 0, ORDERED: SEQ) ;

DCL HITER HI ("HH") ;

HH.DEFINEKEY ("K","N") ; * N - EXTRA ENUMERATING KEY ;

HH.DEFINEDATA ("K") ; * K AUTOMATICALLY ASSUMES ARRAY DATA TYPE ;

HH.DEFINEDONE ( ) ;

** LOAD COMPOSITE (K N) KEY ON THE TABLE ;

** IF DUPLICATES TO BE RETAINED, SET 0 <- N ;

DO J = LBOUND (A) TO HBOUND (A) ;

N = J * ^ NODUPKEY ;

K = A {J} ;

HH.REPLACE() ;

END ;

** USE ITERATOR HI TO RELOAD ARRAY FROM HH TABLE, NOW IN ORDER ;

N = LBOUND (A) - 1 ;

DO RC = HI.FIRST() BY 0 WHILE ( RC = 0 ) ;

N = N + 1 ;

A {N} = K ;

RC = HI.NEXT() ;

END ;

Q = N ;

** FILL ARRAY TAIL WITH MISSING VALUES IF DUPLICATES ARE DELETE ;

DO N = Q + 1 TO HBOUND (A) ;

A {N} = . ;

END ;

** CHECK IF ARRAY IS NOW SORTED ;

SORTED = 1 ;

DO N = LBOUND (A) + 1 TO Q WHILE ( SORTED ) ;

IF A {N - 1} > A {N} THEN SORTED = 0 ;

END ;

PUT SORTED = ;

KEEP X:;

RUN ;

Good Luck

esjackso
Quartz | Level 8

Just out of curiosity would proc rank help?

EJ

bernhard
Calcite | Level 5

Note quite. Although proc rank is very efficient, I have not found a way with proc rank to number all observations by 1..... n . Although it can handle tied ranks (where multiple observations have the same value) in various ways, none of the options for TIES= does what I want.

Reeza
Super User

Depending on what K is, ie fixed or dynamic perhaps the extreme values section of Proc univariate can help:

Base SAS(R) 9.2 Procedures Guide: Statistical Procedures, Third Edition

Sadly ties are counted as multiple occurrences, so it may not help either.

title 'Extreme Blood Pressure Values';

ods select ExtremeValues;

proc univariate data=BPressure nextrval=5;

  var Systolic Diastolic;

run;

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 7 replies
  • 2434 views
  • 0 likes
  • 4 in conversation