Solved: Equivalent of Excel's "Large" function

bernhard · Posted 09-18-2013 06:19 AM

Hello

I am trying to find the equivalent of Excel's large function which returns the k'th largest element in a list.

My problem is the following: "Given a dataset with, say, 10^6 observations and 25 variables return the kth largest element for variables x1 to x25."

I tried a few things with proc sort, proc rank, data steps (obs=, firstobs=), _N_, some sql code. I also tried percentiles, but none of the percentiles options (there are 5 different ones) do exactly correspond to the kth largest element.

Thus, none of the solutions I tried are really satisfying.

The main issues I have are:

1. I do not want to execute some code (say a macro) for each of the variables. Let's say I do not want to sort my data set 25 times. This is to slow.

2. Proc Rank seems to work reasonably fast and it can work on all of the variables at the same time. However, none of the TIES= options suits my needs, since I want the ranks to be 1,2,3,4,..., 999999, 1000000 (simply 1-n all integers without interleaving any numbers)

Can anybody help me with that?

I timed my code again. It turns out that sorting 25 times (for all different variables) takes about 9 seconds (using "sasfile ds open"). I think I can actually live with that. Even though proc rank can give me all the ranks in 2.5 seconds.

Bernhard

bernhard · Posted 09-18-2013 07:56 AM

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

%let nm=%scan(&_varlist_.,&i.);

proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

%do j=1 %to %sysfunc(countw(&k.));

%let kk=%scan(&k.,&j.);

%if &j. eq 1 %then %do;

data __temp___&i.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

%end;

%else %do;

data __temp&kk.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

proc append base=__temp___&i. data=__temp&kk.;quit;

%end;

data &out.;

merge __temp___:;

by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

View solution in original post

Scott_Mitchell · Posted 09-18-2013 06:58 AM

Hi Bernhard,

The largest function will identify the nth largest value from a set of variables you specify.

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25);

RUN;

Which can be consolidated to:

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,of X1-X25);

RUN;

You could also use an array to resort the values, that way the value in x5 is always the 5th largest of each observation.

DATA TWO;

KEEP REORDERED:;

SET HAVE;

ARRAY X(25);

ARRAY REORDERED(25);

DO K=1 TO 25;

REORDERED(K)=largest(K, OF X1-X25);

END;

RUN;

Regards,

Scott

bernhard · Posted 09-18-2013 07:52 AM

Hi Scott

Thanks for your first answer. I have noticed this function as well. But it is not doing what I want.

Consider the dataset sashelp.cars. Say I want to find the 5th largest value of cylinders. The result should be 10. Since there are three observations with 12 cylinders and two with 10 cylinders.

If I take you code for this example, I get an error (the error is obvious to me since the function is built for a different use case)

DATA WANT;

SET sashelp.cars;

LARGEST = LARGEST(5,cylinders);

RUN;

I need to have a look at the second answer.....

bernhard · Posted 09-18-2013 07:56 AM

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

%let nm=%scan(&_varlist_.,&i.);

proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

%do j=1 %to %sysfunc(countw(&k.));

%let kk=%scan(&k.,&j.);

%if &j. eq 1 %then %do;

data __temp___&i.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

%end;

%else %do;

data __temp&kk.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

proc append base=__temp___&i. data=__temp&kk.;quit;

%end;

data &out.;

merge __temp___:;

by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

Scott_Mitchell · Posted 09-18-2013 07:26 AM

Here is another method I found in my code repository. The previous example output to variables with a prefix of reordered. The below will sort and output to the variables with a X prefix.

DATA WANT;

SET HAVE;

ARRAY A{25} X1-X25;

** SET SORT PARAMETERS ;

SEQ = "A" ; * A = ASCENDING, D = DESCENDING ;

NODUPKEY = 0 ; * 0 = DUPLICATES ALLOWED, 1 = DUPLICATES NOT ALLOWED ;

DCL HASH HH (HASHEXP: 0, ORDERED: SEQ) ;

DCL HITER HI ("HH") ;

HH.DEFINEKEY ("K","N") ; * N - EXTRA ENUMERATING KEY ;

HH.DEFINEDATA ("K") ; * K AUTOMATICALLY ASSUMES ARRAY DATA TYPE ;

HH.DEFINEDONE ( ) ;

** LOAD COMPOSITE (K N) KEY ON THE TABLE ;

** IF DUPLICATES TO BE RETAINED, SET 0 <- N ;

DO J = LBOUND (A) TO HBOUND (A) ;

N = J * ^ NODUPKEY ;

K = A {J} ;

HH.REPLACE() ;

END ;

** USE ITERATOR HI TO RELOAD ARRAY FROM HH TABLE, NOW IN ORDER ;

N = LBOUND (A) - 1 ;

DO RC = HI.FIRST() BY 0 WHILE ( RC = 0 ) ;

N = N + 1 ;

A {N} = K ;

RC = HI.NEXT() ;

END ;

Q = N ;

** FILL ARRAY TAIL WITH MISSING VALUES IF DUPLICATES ARE DELETE ;

DO N = Q + 1 TO HBOUND (A) ;

A {N} = . ;

END ;

** CHECK IF ARRAY IS NOW SORTED ;

SORTED = 1 ;

DO N = LBOUND (A) + 1 TO Q WHILE ( SORTED ) ;

IF A {N - 1} > A {N} THEN SORTED = 0 ;

END ;

PUT SORTED = ;

KEEP X:;

RUN ;

Good Luck

esjackso · Posted 09-18-2013 08:45 AM

Just out of curiosity would proc rank help?

EJ

bernhard · Posted 09-18-2013 09:44 AM

Note quite. Although proc rank is very efficient, I have not found a way with proc rank to number all observations by 1..... n . Although it can handle tied ranks (where multiple observations have the same value) in various ways, none of the options for TIES= does what I want.

Reeza · Posted 09-18-2013 10:48 AM

Depending on what K is, ie fixed or dynamic perhaps the extreme values section of Proc univariate can help:

Base SAS(R) 9.2 Procedures Guide: Statistical Procedures, Third Edition

Sadly ties are counted as multiple occurrences, so it may not help either.

title 'Extreme Blood Pressure Values';

ods select ExtremeValues;

proc univariate data=BPressure nextrval=5;

var Systolic Diastolic;

run;

Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Re: Equivalent of Excel's "Large" function

Registration is open

SAS Training: Just a Click Away