DATA Step, Macro, Functions and more

Top 20% and Bottom 10% of observations?

Accepted Solution Solved
Reply
Super Contributor
Posts: 351
Accepted Solution

Top 20% and Bottom 10% of observations?

Hello:

I have the follow sample dataset.  I would like to get 20% observation of RR column and 10% observation of RR column.  Please advice how to do it.  Thanks.

 

data test;

infile datalines dsd;

input Group RR;

datalines;

372, 9.78,

840, 10.2,

320, 10.87,

341, 11.02,

341, 11.02

370, 11.18,

320, 11.61,

322, 11.63,

340, 11.85,

346, 11.97,

343, 12.47,

371, 13.51,

620, 13.89,

830, 16.98,

350, 26.8,

145, 29.9,

830, 32.53,

510, 36.05,

180, 40.64,

181, 46.82,

;

 

 My final data set for Top 20% would be:

372, 9.78,

840, 10.2,

320, 10.87,

341, 11.02,

 

My final data set for Bottom 10% would be:

180, 40.64,

181, 46.82,

 

My actual dataset is over 1K observations. Its observations varies based on the different conditions, and kind of hard to court. 


Accepted Solutions
Solution
a month ago
PROC Star
Posts: 142

Re: Top 20% and Bottom 10% of observations?

SET with POINT= is a good solution here:

Data top20 bottom10;
  do _N_=1 to ceil(nobs*.2);
    set have point=_N_ nobs=nobs;
    output top20;
    end;
  do _N_=ceil(nobs*.9) to nobs;
    set have point=_N_;
    output bottom10;
    end;
  stop;
run;

I used the CEIL for the first loop so as to get at least one observation, even if there are less than 5 observations in the input. But you may want to use ROUND instead. In the bottom you will always get at least one obs, unless you use something like ROUND((NOBS+1)*.9).

View solution in original post


All Replies
Super User
Posts: 11,779

Re: Top 20% and Bottom 10% of observations?

Please clarify: by "top 20%" do you mean the first 20% of the records as they appear in your data set or the 20% smallest values for the variable RR? If the first is your data sorted as you want it?

Super User
Super User
Posts: 8,365

Re: Top 20% and Bottom 10% of observations?

I assumed he wanted first and last 20% of obs.

Super Contributor
Posts: 351

Re: Top 20% and Bottom 10% of observations?

As you could see the sample dataset, I did sort the data first by RR.  Therefore,  I would like to get the top 20% of smallest RR value.

Super User
Posts: 11,779

Re: Top 20% and Bottom 10% of observations?


ybz12003 wrote:

As you could see the sample dataset, I did sort the data first by RR.  Therefore,  I would like to get the top 20% of smallest RR value.


I asked for clarification because you showed a data set. It could very well just have happed that you entered data by order.

Most of the requests about "top" anything usually relate to largest though. So seeing "top" with the smallest values begged for some clarity.

 

A better example would have been to show your data input with data in an unordered fashion then show sort code.

Super User
Posts: 20,716

Re: Top 20% and Bottom 10% of observations?

Use PROC RANK with groups = 10 to get the variable into deciles. Then select the top 20 (rank 8 &9) and the bottom 10 (Rank 0). 

Note how ties are handled which is something you should think about, most other solutions will not deal with ties.

Super User
Super User
Posts: 8,365

Re: Top 20% and Bottom 10% of observations?

Something like:

proc sql noprint;
  select int(NOBS * 0.2) into :TWENTY from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS";
  select int(NOBS * 0.1) into :TEN from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS";
  select NOBS into :TOT from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS";  
quit;

%put 20=&twenty., 10=&ten., Tot=&tot.;

data want_twenty want_ten;
  set sashelp.cars;
  if _n_ < &twenty. or _n_ > (&tot. - &twenty.) then output want_twenty;
  if _n_ < &ten. or _n_ > (&tot. - &ten.) then output want_ten;
run;
 
Super Contributor
Posts: 351

Re: Top 20% and Bottom 10% of observations?

Hi RW9:

 

I modified your code as list below.  However,  I got an error message from the log.  Any idea why?

 

 

proc sort data=ABC.notnosmisc; by RR; run;

proc sql;

select int(NOBS *0.2) into : TWENTY from ABC.notnosmisc;

quit;

ERROR: The following columns were not found in the contributing tables: NOBS.

 

 

Respected Advisor
Posts: 2,048

Re: Top 20% and Bottom 10% of observations?

[ Edited ]

@Reeza has provided the best answer, in my opinion. It's extremely simple, and it has the ability to handle ties properly, while the other methods do not. Don't try to create code that does this for yourself, when SAS has already done so.

--
Paige Miller
Super User
Super User
Posts: 8,365

Re: Top 20% and Bottom 10% of observations?

I would agree here, that @Reeza has the better answer, assuming that is what you want and only that.

To note, the reason you get an error is because you have replace dictionary.tables with the name of your dataset.  Dictionary.tables is the same as sashelp.vtable in that it holds metadata about the table.  It is from there that the observation count is being taken.  So you need to select from dictionary.tables where libname and memname are your libname and memname.

Highlighted
Super User
Posts: 5,714

Re: Top 20% and Bottom 10% of observations?

Given that your data set is already sorted by RR, you could take the simple route:

 

data want;

set have nobs=_nobs_;

if _n_ / _nobs_ < 0.2 or _n_ / _nobs_ > 0.9;

run;

Solution
a month ago
PROC Star
Posts: 142

Re: Top 20% and Bottom 10% of observations?

SET with POINT= is a good solution here:

Data top20 bottom10;
  do _N_=1 to ceil(nobs*.2);
    set have point=_N_ nobs=nobs;
    output top20;
    end;
  do _N_=ceil(nobs*.9) to nobs;
    set have point=_N_;
    output bottom10;
    end;
  stop;
run;

I used the CEIL for the first loop so as to get at least one observation, even if there are less than 5 observations in the input. But you may want to use ROUND instead. In the bottom you will always get at least one obs, unless you use something like ROUND((NOBS+1)*.9).

Contributor
Posts: 21

Re: Top 20% and Bottom 10% of observations?

Hi @ybz12003 .. it will be something like below , 

 

data test;
infile datalines dsd;
input Group RR;
obs=_n_;
datalines;
372, 9.78,
840, 10.2,
320, 10.87,
341, 11.02,
341, 11.02
370, 11.18,
320, 11.61,
322, 11.63,
340, 11.85,
346, 11.97,
343, 12.47,
371, 13.51,
620, 13.89,
830, 16.98,
350, 26.8,
145, 29.9,
830, 32.53,
510, 36.05,
180, 40.64,
181, 46.82,
;
run;

proc sort data= test out=test1;
by descending obs;
run;

data test2;
set test1;
oops1=_n_;
run;

proc sql;
create table first_20 as select group,RR from test
having obs/ Max(obs) <=0.2;
create table last_10 as select group,RR from test2
having oops1/ Max(oops1) <=0.1;
quit;

 

Super Contributor
Posts: 351

Re: Top 20% and Bottom 10% of observations?

Thanks for everyone's great advice.   So far, I have the codes written by Astound and S_lassen are working in my practice.   But one observation is different between Astound and S_lassen's codes.   Though Astound is simpler, S_lassen is more comprehensive to cover the decimal issue.  

In the end, I truly thank everyone spending your precious time answer my question and helping me!

Super User
Posts: 20,716

Re: Top 20% and Bottom 10% of observations?

Percentiles don't have a standard definition, kind of like a month, so you can have variability in the answers. 

 


ybz12003 wrote:

Thanks for everyone's great advice.   So far, I have the codes written by Astound and S_lassen are working in my practice.   But one observation is different between Astound and S_lassen's codes.   Though Astound is simpler, S_lassen is more comprehensive to cover the decimal issue.  

In the end, I truly thank everyone spending your precious time answer my question and helping me!


 

☑ This topic is solved.

Need further help from the community? Please ask a new question.

Discussion stats
  • 14 replies
  • 370 views
  • 10 likes
  • 8 in conversation