BookmarkSubscribeRSS Feed
jerry898969
Pyrite | Level 9
I have 32 variables that I need to get the top 5 and bottom five distinct values. I'm not sure what would be the cleanest way to do this. Any help pointing me in the right direction would be greatly appreciated.

Thank you
19 REPLIES 19
Peter_C
Rhodochrosite | Level 12
proc means
topic IDGROUP

see Example 12: Identifying the Top Three Extreme Values with the Output Statistics
at http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a001016379.htm added link to example Message was edited by: Peter.C
data_null__
Jade | Level 19
As mentioned the DISTINCT restriction means that to do this with PROC SUMMARY would require the data be summarized and then the top and bottom five chosen. With lots of levels of the analysis variables which are used as CLASS variables in the first step this would lose the performance race.

[pre]
%let vars=Age Height Weight;
proc summary data=sashelp.class chartype;
class &vars;
ways 1;
output out=freqs(rename=(_type_=type _freq_=N));
run;
proc print data=freqs;
run;
proc summary data=freqs;
by type;
freq N;
output out=bot idgroup(min(&vars) obs missing out[5](&vars N)=);
output out=top idgroup(max(&vars) obs missing out[5](&vars N)=);
run;
data five;
length position $3;
set bot top indsname=indsname;
by type;
position = scan(indsname,-1);
run;
proc print data=five;
run;
[/pre]
NickR
Quartz | Level 8
Alternatively,

You may use proc sql or proc sort to select distinct values, order them by their key variables, and use point option in data step to select the records of your choice.

proc sql noprint;
create table two as select distinct * from one order by var1,var2;
select count(*) into :n from two;
quit;

data three;
do rcrd=1,2,3,4,5,(&n-4),(&n-3),(&n-2),(&n-1),&n;
set two point=rcrd;
output;
end;
stop;
run;
polingjw
Quartz | Level 8
You could also easily transpose the dataset and use arrays to find the 5 smallest and largest. Here is an example with a made up dataset:

[pre]
data test;
array var{32};
do x=1 to 1000;
do i=-15 to 16;
var{i+16} = i*ranuni(0);
end;
output;
end;
drop i x;
run;

proc transpose data=test out=test_transpose name=var;
run;

data results;
set test_transpose;
array col{*} col:;
array max_{5};
array min_{5};
do i=1 to 5;
max_{i} = max(of col
  • );
    min_{i} = min(of col
  • );
    do j=1 to dim(col);
    if col{j}=min_{i} then col{j}= .;
    if col{j}=max_{i} then col{j}=.;
    end;
    end;
    keep var max_: min_:;
    run;
    [/pre]
  • jerry898969
    Pyrite | Level 9
    Thank you everyone for your help.

    This is how I ended up doing it:

    proc sql outobs=5 ;
    create table temp_top5
    as
    select distinct variable1, 'TOP' as source
    from temp
    order by variable1
    ;
    quit ;

    I do the same for the bottom and it gave me what I was looking for. Thank you for all of your suggestions.
    polingjw
    Quartz | Level 8
    Are you planning to repeat those two SQL steps for all 32 variables in your dataset? Why use 64 steps to accomplish something that could be done so much more concisely? Or, maybe I just misunderstood your original posting.
    Peter_C
    Rhodochrosite | Level 12
    does that mean running once for each variable? 32times !

    If your input data set will fit in memory, before the 32 runs, do
    sasfile work.temp open ;
    SASKiwi
    PROC Star
    Here is another way of doing this:

    ods output ExtremeValues = extreme;

    proc univariate data = sashelp.class;
    var age;
    run;

    ods output close;

    The ODS statement creates a dataset EXTREME with the top and bottom 5 values.

    If you have more than one variable just list them in the VAR statement. Message was edited by: SASKiwi
    chang_y_chung_hotmail_com
    Obsidian | Level 7
    Here is a one-pass solution that does not involve transposing variables. There are many ways to make my solution run even faster, but left as an exercise for those who are interested. 🙂



       /* modified polingjw^s test data */


       data test;


          array var{32} var01-var32;


          do x=1 to 1e6;


             do i=-15 to 16;


                var{i+16} = i*ranuni(0);


             end;


             output;


          end;


          drop i x;


       run;


     


       /* polingjw */


       proc transpose data=test out=test_transpose name=var;


       run;


       data polingjw;


          set test_transpose;


          array col{*} col:;


          array max_{5};


          array min_{5};


          do i=1 to 5;


             max_{i} = max(of col

  • );

  •          min_{i} = min(of col

  • );

  •          do j=1 to dim(col);


                if col{j}=min_{i} then col{j}= .;


                if col{j}=max_{i} then col{j}=.;


             end;


          end;


          keep  var max_: min_:;


       run;


       /* re-shape to long for comparisons */


       data polingjwLong;


         set polingjw;


         length var $32 type $6 num value 8;


         keep var type num value;


         array max_(5) max_1 - max_5;


         array min_(5) min_1 - min_5;


         do num = 1 to 5;


            type = "top";


            value = max_(num);


            output;


         end;


         do num = 1 to 5;


            type = "bottom";


            value = min_(num);


            output;


         end;


       run;   


     


       /* chang */


       %let maxNumVars = 3000; 


       %let top = 5; 


     


       data chang;


          set test end=end;


     


          array top(1:&maxNumVars,-&top:&top) _temporary_;


          array cur(*) _numeric_;


          do k = 1 to dim(cur);


             if missing(cur(k)) then leave;


             link doTop;


             link doBottom;


          end;


     


          if end then link doOutput;


          return;


     


          doTop:


             do j = 1 to ⊤


                if missing(top(k,j)) then do;


                   top(k,j) = cur(k);


                   leave;


                end


                if (cur(k) < top(k,j)) then continue;


                if (cur(k) = top(k,j)) then leave;       


                do i = &top to j+1 by -1;


                   top(k, i) = top(k, i-1);


                end;


                top(k, j) = cur(k);


                leave;


             end


          return;


          


          doBottom:


             do j = 1 to &top;


                if missing(top(k,-j)) then do;


                   top(k,-j) = cur(k);


                   leave;


                end;


                if (cur(k) > top(k,-j)) then continue;


                if (cur(k) = top(k,-j)) then leave;


                do i = &top to j+1 by -1;


                   top(k,-i) = top(k,-i+1);


                end;


                top(k,-j) = cur(k);


                leave;


             end


          return;


     


          doOutput:


             length var $32 type $6;


             do k = 1 to dim(cur);


                var = vname(cur(k));


                type = "top";


                do num = 1 to &top;


                   value = top(k, num); 


                   output;


                end;


                type = "bottom";


                do num = 1 to &top;


                   value = top(k,-num);


                   output;


                end;


             end;


             keep var type num value;


          return;


       run


     


       proc compare base=polingjwLong comp=chang;


       run;


       /* on log, in part


       NOTE: No unequal values were found. All values compared are exactly equal.


       */

    polingjw
    Quartz | Level 8
    Chang,

    I’m impressed by your much more efficient program. In a dataset with one million observations, your program runs in about 1/4th the time that mine does. I actually thought about using a multi-dimensional array solution when I first read the original posting but quickly gave up once I started getting confused by my own program. Thanks for demonstrating the better path forward!
    Peter_C
    Rhodochrosite | Level 12
    polingjw and chang
    how does the array approach compare with the ods output approach from SASKIWI at http://support.sas.com/forums/thread.jspa?messageID=49801#49801

    peterC
    (an interested by-stander)
    polingjw
    Quartz | Level 8
    On my computer, Chang’s solution took 32.54 seconds, real time. The univariate procedure took 1:59.84, almost four times as long. However, the results are not equivalent. The original posting asked for the top five and bottom five distinct values. The univariate procedure does not produce distinct results.
    Peter_C
    Rhodochrosite | Level 12
    thank you
    that is a stunning comparison!
    Ksharp
    Super User
    Hi.
    polingjw .
    You are wrong ,
    In the documentation about proc univariate ,there is an option which can select distinct top and bottom value.
    If you would ,can search it.
    proc univariate is more powerful than you image.


    Ksharp

    sas-innovate-2024.png

    Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

    Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

     

    Register now!

    What is Bayesian Analysis?

    Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

    Find more tutorials on the SAS Users YouTube channel.

    Click image to register for webinarClick image to register for webinar

    Classroom Training Available!

    Select SAS Training centers are offering in-person courses. View upcoming courses for:

    View all other training opportunities.

    Discussion stats
    • 19 replies
    • 14495 views
    • 1 like
    • 10 in conversation