turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Find a Community

- Home
- /
- SAS Programming
- /
- General Programming
- /
- Equivalent of Excel's "Large" function

Topic Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

09-18-2013 06:19 AM

Hello

I am trying to find the equivalent of Excel's large function which returns the k'th largest element in a list.

My problem is the following: "Given a dataset with, say, 10^6 observations and 25 variables return the kth largest element for variables x1 to x25."

I tried a few things with proc sort, proc rank, data steps (obs=, firstobs=), _N_, some sql code. I also tried percentiles, but none of the percentiles options (there are 5 different ones) do exactly correspond to the kth largest element.

Thus, none of the solutions I tried are really satisfying.

The main issues I have are:

1. I do not want to execute some code (say a macro) for each of the variables. Let's say I do not want to sort my data set 25 times. This is to slow.

2. Proc Rank seems to work reasonably fast and it can work on all of the variables at the same time. However, none of the TIES= options suits my needs, since I want the ranks to be 1,2,3,4,..., 999999, 1000000 (simply 1-n all integers without interleaving any numbers)

Can anybody help me with that?

I timed my code again. It turns out that sorting 25 times (for all different variables) takes about 9 seconds (using "sasfile ds open"). I think I can actually live with that. Even though proc rank can give me all the ranks in 2.5 seconds.

Bernhard

Accepted Solutions

Solution

09-18-2013
07:56 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 07:56 AM

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

%let nm=%scan(&_varlist_.,&i.);

proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

%do j=1 %to %sysfunc(countw(&k.));

%let kk=%scan(&k.,&j.);

%if &j. eq 1 %then %do;

data __temp___&i.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

%end;

%else %do;

data __temp&kk.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

proc append base=__temp___&i. data=__temp&kk.;quit;

%end;

%end;

%end;

data &out.;

merge __temp___:;

by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

All Replies

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 06:58 AM

Hi Bernhard,

The largest function will identify the nth largest value from a set of variables you specify.

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14,X15,X16,X17,X18,X19,X20,X21,X22,X23,X24,X25);

RUN;

Which can be consolidated to:

DATA WANT;

SET HAVE;

LARGEST = LARGEST(1,of X1-X25);

RUN;

You could also use an array to resort the values, that way the value in x5 is always the 5th largest of each observation.

DATA TWO;

KEEP REORDERED:;

SET HAVE;

ARRAY X(25);

ARRAY REORDERED(25);

DO K=1 TO 25;

REORDERED(K)=largest(K, OF X1-X25);

END;

RUN;

Regards,

Scott

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to Scott_Mitchell

09-18-2013 07:52 AM

Hi Scott

Thanks for your first answer. I have noticed this function as well. But it is not doing what I want.

Consider the dataset sashelp.cars. Say I want to find the 5th largest value of cylinders. The result should be 10. Since there are three observations with 12 cylinders and two with 10 cylinders.

If I take you code for this example, I get an error (the error is obvious to me since the function is built for a different use case)

DATA WANT;

SET sashelp.cars;

LARGEST = LARGEST(5,cylinders);

RUN;

I need to have a look at the second answer.....

Solution

09-18-2013
07:56 AM

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 07:56 AM

For now I have settled with a macro (see below).

I cannot get rid of the feelign that there is a nicer way to do this.

%macro large(data=ds,out=out,k=,vars=_all_);

/*returns the kth largest values*/

%local i nm;

proc datasets nolist;delete __temp:;quit;

data __temp__;set &data.(keep=&vars. obs=1);run;

proc contents noprint data=__temp__ out=__cont__(keep=name);quit;

proc sql noprint;select name into: _varlist_ separated by " " from __cont__;quit;

sasfile &data. open;

%do i=1 %to %sysfunc(countw(&_varlist_.));

%let nm=%scan(&_varlist_.,&i.);

proc sort data=&data. out=__temp__(keep=&nm.);by descending &nm.;quit;

%do j=1 %to %sysfunc(countw(&k.));

%let kk=%scan(&k.,&j.);

%if &j. eq 1 %then %do;

data __temp___&i.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

%end;

%else %do;

data __temp&kk.;

format k 14.;

set __temp__(obs=&kk. firstobs=&kk.);

k=&kk.;

run;

proc append base=__temp___&i. data=__temp&kk.;quit;

%end;

%end;

%end;

data &out.;

merge __temp___:;

by k;

run;

sasfile &data. close;

%mend;

%large(data=import2,k=10 100 1000);

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 07:26 AM

Here is another method I found in my code repository. The previous example output to variables with a prefix of reordered. The below will sort and output to the variables with a X prefix.

DATA WANT;

SET HAVE;

ARRAY A{25} X1-X25;

** SET SORT PARAMETERS ;

SEQ = "A" ; * A = ASCENDING, D = DESCENDING ;

NODUPKEY = 0 ; * 0 = DUPLICATES ALLOWED, 1 = DUPLICATES NOT ALLOWED ;

DCL HASH HH (HASHEXP: 0, ORDERED: SEQ) ;

DCL HITER HI ("HH") ;

HH.DEFINEKEY ("K","N") ; * N - EXTRA ENUMERATING KEY ;

HH.DEFINEDATA ("K") ; * K AUTOMATICALLY ASSUMES ARRAY DATA TYPE ;

HH.DEFINEDONE ( ) ;

** LOAD COMPOSITE (K N) KEY ON THE TABLE ;

** IF DUPLICATES TO BE RETAINED, SET 0 <- N ;

DO J = LBOUND (A) TO HBOUND (A) ;

N = J * ^ NODUPKEY ;

K = A {J} ;

HH.REPLACE() ;

END ;

** USE ITERATOR HI TO RELOAD ARRAY FROM HH TABLE, NOW IN ORDER ;

N = LBOUND (A) - 1 ;

DO RC = HI.FIRST() BY 0 WHILE ( RC = 0 ) ;

N = N + 1 ;

A {N} = K ;

RC = HI.NEXT() ;

END ;

Q = N ;

** FILL ARRAY TAIL WITH MISSING VALUES IF DUPLICATES ARE DELETE ;

DO N = Q + 1 TO HBOUND (A) ;

A {N} = . ;

END ;

** CHECK IF ARRAY IS NOW SORTED ;

SORTED = 1 ;

DO N = LBOUND (A) + 1 TO Q WHILE ( SORTED ) ;

IF A {N - 1} > A {N} THEN SORTED = 0 ;

END ;

PUT SORTED = ;

KEEP X:;

RUN ;

Good Luck

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 08:45 AM

Just out of curiosity would proc rank help?

EJ

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to esjackso

09-18-2013 09:44 AM

Note quite. Although proc rank is very efficient, I have not found a way with proc rank to number all observations by 1..... n . Although it can handle tied ranks (where multiple observations have the same value) in various ways, none of the options for TIES= does what I want.

- Mark as New
- Bookmark
- Subscribe
- RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Posted in reply to bernhard

09-18-2013 10:48 AM

Depending on what K is, ie fixed or dynamic perhaps the extreme values section of Proc univariate can help:

Base SAS(R) 9.2 Procedures Guide: Statistical Procedures, Third Edition

Sadly ties are counted as multiple occurrences, so it may not help either.

title 'Extreme Blood Pressure Values';

ods select ExtremeValues;

proc univariate data=BPressure nextrval=5;

var Systolic Diastolic;

run;