Re: Getting Top and Bottom 5 distinct values

jerry898969 · Posted 02-03-2011 09:45 AM

I have 32 variables that I need to get the top 5 and bottom five distinct values. I'm not sure what would be the cleanest way to do this. Any help pointing me in the right direction would be greatly appreciated.

Thank you

Peter_C · Posted 02-03-2011 10:48 AM

proc means
topic IDGROUP

see Example 12: Identifying the Top Three Extreme Values with the Output Statistics
at http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a001016379.htm added link to example Message was edited by: Peter.C

data_null__ · Posted 02-05-2011 09:42 AM

As mentioned the DISTINCT restriction means that to do this with PROC SUMMARY would require the data be summarized and then the top and bottom five chosen. With lots of levels of the analysis variables which are used as CLASS variables in the first step this would lose the performance race.

[pre]
%let vars=Age Height Weight;
proc summary data=sashelp.class chartype;
class &vars;
ways 1;
output out=freqs(rename=(_type_=type _freq_=N));
run;
proc print data=freqs;
run;
proc summary data=freqs;
by type;
freq N;
output out=bot idgroup(min(&vars) obs missing out[5](&vars N)=);
output out=top idgroup(max(&vars) obs missing out[5](&vars N)=);
run;
data five;
length position $3;
set bot top indsname=indsname;
by type;
position = scan(indsname,-1);
run;
proc print data=five;
run;
[/pre]

NickR · Posted 02-03-2011 11:01 AM

Alternatively,

You may use proc sql or proc sort to select distinct values, order them by their key variables, and use point option in data step to select the records of your choice.

proc sql noprint;
create table two as select distinct * from one order by var1,var2;
select count(*) into :n from two;
quit;

data three;
do rcrd=1,2,3,4,5,(&n-4),(&n-3),(&n-2),(&n-1),&n;
set two point=rcrd;
output;
end;
stop;
run;

polingjw · Posted 02-03-2011 11:12 AM

You could also easily transpose the dataset and use arrays to find the 5 smallest and largest. Here is an example with a made up dataset:

[pre]
data test;
array var{32};
do x=1 to 1000;
do i=-15 to 16;
var{i+16} = i*ranuni(0);
end;
output;
end;
drop i x;
run;

proc transpose data=test out=test_transpose name=var;
run;

data results;
set test_transpose;
array col{*} col:;
array max_{5};
array min_{5};
do i=1 to 5;
max_{i} = max(of col

);
min_{i} = min(of col

);
do j=1 to dim(col);
if col{j}=min_{i} then col{j}= .;
if col{j}=max_{i} then col{j}=.;
end;
end;
keep var max_: min_:;
run;
[/pre]

jerry898969 · Posted 02-03-2011 11:41 AM

Thank you everyone for your help.

This is how I ended up doing it:

proc sql outobs=5 ;
create table temp_top5
as
select distinct variable1, 'TOP' as source
from temp
order by variable1
;
quit ;

I do the same for the bottom and it gave me what I was looking for. Thank you for all of your suggestions.

polingjw · Posted 02-03-2011 11:59 AM

Are you planning to repeat those two SQL steps for all 32 variables in your dataset? Why use 64 steps to accomplish something that could be done so much more concisely? Or, maybe I just misunderstood your original posting.

Peter_C · Posted 02-03-2011 12:01 PM

does that mean running once for each variable? 32times !

If your input data set will fit in memory, before the 32 runs, do
sasfile work.temp open ;

SASKiwi · Posted 02-03-2011 04:23 PM

Here is another way of doing this:

ods output ExtremeValues = extreme;

proc univariate data = sashelp.class;
var age;
run;

ods output close;

The ODS statement creates a dataset EXTREME with the top and bottom 5 values.

If you have more than one variable just list them in the VAR statement. Message was edited by: SASKiwi

chang_y_chung_hotmail_com · Posted 02-03-2011 04:24 PM

Here is a one-pass solution that does not involve transposing variables. There are many ways to make my solution run even faster, but left as an exercise for those who are interested. 🙂

/* modified polingjw^s test data */

data test;

array var{32} var01-var32;

do x=1 to 1e6;

do i=-15 to 16;

var{i+16} = i*ranuni(0);

end;

output;

end;

drop i x;

run;

/* polingjw */

proc transpose data=test out=test_transpose name=var;

run;

data polingjw;

set test_transpose;

array col{*} col:;

array max_{5};

array min_{5};

do i=1 to 5;

max_{i} = max(of col

);

min_{i} = min(of col

);

do j=1 to dim(col);

if col{j}=min_{i} then col{j}= .;

if col{j}=max_{i} then col{j}=.;

end;

keep var max_: min_:;

run;

/* re-shape to long for comparisons */

data polingjwLong;

set polingjw;

length var $32 type $6 num value 8;

keep var type num value;

array max_(5) max_1 - max_5;

array min_(5) min_1 - min_5;

do num = 1 to 5;

type = "top";

value = max_(num);

output;

end;

do num = 1 to 5;

type = "bottom";

value = min_(num);

output;

end;

run;

/* chang */

%let maxNumVars = 3000;

%let top = 5;

data chang;

set test end=end;

array top(1:&maxNumVars,-&top:&top) _temporary_;

array cur(*) _numeric_;

do k = 1 to dim(cur);

if missing(cur(k)) then leave;

link doTop;

link doBottom;

end;

if end then link doOutput;

return;

doTop:

do j = 1 to &top;

if missing(top(k,j)) then do;

top(k,j) = cur(k);

leave;

end;

if (cur(k) < top(k,j)) then continue;

if (cur(k) = top(k,j)) then leave;

do i = &top to j+1 by -1;

top(k, i) = top(k, i-1);

end;

top(k, j) = cur(k);

leave;

end;

return;

doBottom:

do j = 1 to &top;

if missing(top(k,-j)) then do;

top(k,-j) = cur(k);

leave;

end;

if (cur(k) > top(k,-j)) then continue;

if (cur(k) = top(k,-j)) then leave;

do i = &top to j+1 by -1;

top(k,-i) = top(k,-i+1);

end;

top(k,-j) = cur(k);

leave;

end;

return;

doOutput:

length var $32 type $6;

do k = 1 to dim(cur);

var = vname(cur(k));

type = "top";

do num = 1 to &top;

value = top(k, num);

output;

end;

type = "bottom";

do num = 1 to &top;

value = top(k,-num);

output;

end;

keep var type num value;

return;

run;

proc compare base=polingjwLong comp=chang;

run;

/* on log, in part

NOTE: No unequal values were found. All values compared are exactly equal.

*/

polingjw · Posted 02-04-2011 11:02 AM

Chang,

I’m impressed by your much more efficient program. In a dataset with one million observations, your program runs in about 1/4th the time that mine does. I actually thought about using a multi-dimensional array solution when I first read the original posting but quickly gave up once I started getting confused by my own program. Thanks for demonstrating the better path forward!

Peter_C · Posted 02-04-2011 11:19 AM

polingjw and chang
how does the array approach compare with the ods output approach from SASKIWI at http://support.sas.com/forums/thread.jspa?messageID=49801#49801

peterC
(an interested by-stander)

polingjw · Posted 02-04-2011 11:33 AM

On my computer, Chang’s solution took 32.54 seconds, real time. The univariate procedure took 1:59.84, almost four times as long. However, the results are not equivalent. The original posting asked for the top five and bottom five distinct values. The univariate procedure does not produce distinct results.

Peter_C · Posted 02-04-2011 12:50 PM

thank you
that is a stunning comparison!

Ksharp · Posted 02-16-2011 11:25 PM

Hi.
polingjw .
You are wrong ,
In the documentation about proc univariate ,there is an option which can select distinct top and bottom value.
If you would ,can search it.
proc univariate is more powerful than you image.

Ksharp

Registration is open

SAS Training: Just a Click Away