BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
bernhard
Calcite | Level 5

Hello

I am trying to find the equivalent of Excel's large function which returns the k'th largest element in a list.

My problem is the following: "Given a dataset with, say, 10^6 observations and 25 variables return the kth largest element for variables x1 to x25."

I tried a few things with proc sort, proc rank, data steps (obs=, firstobs=), _N_, some sql code. I also tried percentiles, but none of the percentiles options (there are 5 different ones) do exactly correspond to the kth largest element.

Thus, none of the solutions I tried are really satisfying.

The main issues I have are:

1. I do not want to execute some code (say a macro) for each of the variables. Let's say I do not want to sort my data set 25 times. This is to slow.

2. Proc Rank seems to work reasonably fast and it can work on all of the variables at the same time. However, none of the TIES= options suits my needs, since I want the ranks to be 1,2,3,4,..., 999999, 1000000 (simply 1-n all integers without interleaving any numbers)

Can anybody help me with that?

I timed my code again. It turns out that sorting 25 times (for all different variables) takes about 9 seconds (using "sasfile ds open"). I think I can actually live with that. Even though proc rank can give me all the ranks in 2.5 seconds.

Bernhard

1 ACCEPTED SOLUTION

Accepted Solutions
bernhard
Calcite | Level 5

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

  %let nm=%scan(&_varlist_.,&i.);

  proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

  %do j=1 %to %sysfunc(countw(&k.));

  %let kk=%scan(&k.,&j.);

  %if &j. eq 1 %then %do;

  data __temp___&i.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  %end;

  %else %do;

  data __temp&kk.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  proc append base=__temp___&i. data=__temp&kk.;quit;

  %end;

  %end;

%end;

data &out.;

  merge __temp___:;

  by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

View solution in original post

7 REPLIES 7
Scott_Mitchell
Quartz | Level 8

Hi Bernhard,

The largest function will identify the nth largest value from a set of variables you specify.

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25);

RUN;

Which can be consolidated to:

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,of X1-X25);

RUN;

You could also use an array to resort the values, that way the value in x5 is always the 5th largest of each observation.

DATA TWO;

  KEEP REORDERED:;

  SET HAVE;

  ARRAY X(25);

  ARRAY REORDERED(25);

  DO K=1 TO 25;

    REORDERED(K)=largest(K, OF X1-X25);

  END;

RUN;

Regards,

Scott

bernhard
Calcite | Level 5

Hi Scott

Thanks for your first answer. I have noticed this function as well. But it is not doing what I want.

Consider the dataset sashelp.cars. Say I want to find the 5th largest value of cylinders. The result should be 10. Since there are three observations with 12 cylinders and two with 10 cylinders.

If I take you code for this example, I get an error (the error is obvious to me since the function is built for a different use case)

DATA WANT;

SET sashelp.cars;

  LARGEST = LARGEST(5,cylinders);

RUN;

I need to have  a look at the second answer.....

bernhard
Calcite | Level 5

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

  %let nm=%scan(&_varlist_.,&i.);

  proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

  %do j=1 %to %sysfunc(countw(&k.));

  %let kk=%scan(&k.,&j.);

  %if &j. eq 1 %then %do;

  data __temp___&i.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  %end;

  %else %do;

  data __temp&kk.;

  format k 14.;

  set __temp__(obs=&kk. firstobs=&kk.);

  k=&kk.;

  run;

  proc append base=__temp___&i. data=__temp&kk.;quit;

  %end;

  %end;

%end;

data &out.;

  merge __temp___:;

  by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

Scott_Mitchell
Quartz | Level 8

Here is another method I found in my code repository.  The previous example output to variables with a prefix of reordered.  The below will sort and output to the variables with a X prefix.

DATA WANT;

SET HAVE;

ARRAY A{25} X1-X25;

** SET SORT PARAMETERS ;

SEQ = "A" ; * A = ASCENDING, D = DESCENDING ;

NODUPKEY = 0 ; * 0 = DUPLICATES ALLOWED, 1 = DUPLICATES NOT ALLOWED ;

DCL HASH HH (HASHEXP: 0, ORDERED: SEQ) ;

DCL HITER HI ("HH") ;

HH.DEFINEKEY ("K","N") ; * N - EXTRA ENUMERATING KEY ;

HH.DEFINEDATA ("K") ; * K AUTOMATICALLY ASSUMES ARRAY DATA TYPE ;

HH.DEFINEDONE ( ) ;

** LOAD COMPOSITE (K N) KEY ON THE TABLE ;

** IF DUPLICATES TO BE RETAINED, SET 0 <- N ;

DO J = LBOUND (A) TO HBOUND (A) ;

N = J * ^ NODUPKEY ;

K = A {J} ;

HH.REPLACE() ;

END ;

** USE ITERATOR HI TO RELOAD ARRAY FROM HH TABLE, NOW IN ORDER ;

N = LBOUND (A) - 1 ;

DO RC = HI.FIRST() BY 0 WHILE ( RC = 0 ) ;

N = N + 1 ;

A {N} = K ;

RC = HI.NEXT() ;

END ;

Q = N ;

** FILL ARRAY TAIL WITH MISSING VALUES IF DUPLICATES ARE DELETE ;

DO N = Q + 1 TO HBOUND (A) ;

A {N} = . ;

END ;

** CHECK IF ARRAY IS NOW SORTED ;

SORTED = 1 ;

DO N = LBOUND (A) + 1 TO Q WHILE ( SORTED ) ;

IF A {N - 1} > A {N} THEN SORTED = 0 ;

END ;

PUT SORTED = ;

KEEP X:;

RUN ;

Good Luck

esjackso
Quartz | Level 8

Just out of curiosity would proc rank help?

EJ

bernhard
Calcite | Level 5

Note quite. Although proc rank is very efficient, I have not found a way with proc rank to number all observations by 1..... n . Although it can handle tied ranks (where multiple observations have the same value) in various ways, none of the options for TIES= does what I want.

Reeza
Super User

Depending on what K is, ie fixed or dynamic perhaps the extreme values section of Proc univariate can help:

Base SAS(R) 9.2 Procedures Guide: Statistical Procedures, Third Edition

Sadly ties are counted as multiple occurrences, so it may not help either.

title 'Extreme Blood Pressure Values';

ods select ExtremeValues;

proc univariate data=BPressure nextrval=5;

  var Systolic Diastolic;

run;

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 7 replies
  • 2004 views
  • 0 likes
  • 4 in conversation