Hello:
I have the follow sample dataset. I would like to get 20% observation of RR column and 10% observation of RR column. Please advice how to do it. Thanks.
data test;
infile datalines dsd;
input Group RR;
datalines;
372, 9.78,
840, 10.2,
320, 10.87,
341, 11.02,
341, 11.02
370, 11.18,
320, 11.61,
322, 11.63,
340, 11.85,
346, 11.97,
343, 12.47,
371, 13.51,
620, 13.89,
830, 16.98,
350, 26.8,
145, 29.9,
830, 32.53,
510, 36.05,
180, 40.64,
181, 46.82,
;
My final data set for Top 20% would be:
372, 9.78,
840, 10.2,
320, 10.87,
341, 11.02,
My final data set for Bottom 10% would be:
180, 40.64,
181, 46.82,
My actual dataset is over 1K observations. Its observations varies based on the different conditions, and kind of hard to court.
SET with POINT= is a good solution here:
Data top20 bottom10; do _N_=1 to ceil(nobs*.2); set have point=_N_ nobs=nobs; output top20; end; do _N_=ceil(nobs*.9) to nobs; set have point=_N_; output bottom10; end; stop; run;
I used the CEIL for the first loop so as to get at least one observation, even if there are less than 5 observations in the input. But you may want to use ROUND instead. In the bottom you will always get at least one obs, unless you use something like ROUND((NOBS+1)*.9).
Please clarify: by "top 20%" do you mean the first 20% of the records as they appear in your data set or the 20% smallest values for the variable RR? If the first is your data sorted as you want it?
I assumed he wanted first and last 20% of obs.
As you could see the sample dataset, I did sort the data first by RR. Therefore, I would like to get the top 20% of smallest RR value.
@ybz12003 wrote:
As you could see the sample dataset, I did sort the data first by RR. Therefore, I would like to get the top 20% of smallest RR value.
I asked for clarification because you showed a data set. It could very well just have happed that you entered data by order.
Most of the requests about "top" anything usually relate to largest though. So seeing "top" with the smallest values begged for some clarity.
A better example would have been to show your data input with data in an unordered fashion then show sort code.
Use PROC RANK with groups = 10 to get the variable into deciles. Then select the top 20 (rank 8 &9) and the bottom 10 (Rank 0).
Note how ties are handled which is something you should think about, most other solutions will not deal with ties.
Something like:
proc sql noprint; select int(NOBS * 0.2) into :TWENTY from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS"; select int(NOBS * 0.1) into :TEN from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS"; select NOBS into :TOT from DICTIONARY.TABLES where libname="SASHELP" and memname="CARS"; quit; %put 20=&twenty., 10=&ten., Tot=&tot.; data want_twenty want_ten; set sashelp.cars; if _n_ < &twenty. or _n_ > (&tot. - &twenty.) then output want_twenty; if _n_ < &ten. or _n_ > (&tot. - &ten.) then output want_ten; run;
Hi RW9:
I modified your code as list below. However, I got an error message from the log. Any idea why?
proc sort data=ABC.notnosmisc; by RR; run;
proc sql;
select int(NOBS *0.2) into : TWENTY from ABC.notnosmisc;
quit;
ERROR: The following columns were not found in the contributing tables: NOBS.
@Reeza has provided the best answer, in my opinion. It's extremely simple, and it has the ability to handle ties properly, while the other methods do not. Don't try to create code that does this for yourself, when SAS has already done so.
I would agree here, that @Reeza has the better answer, assuming that is what you want and only that.
To note, the reason you get an error is because you have replace dictionary.tables with the name of your dataset. Dictionary.tables is the same as sashelp.vtable in that it holds metadata about the table. It is from there that the observation count is being taken. So you need to select from dictionary.tables where libname and memname are your libname and memname.
Given that your data set is already sorted by RR, you could take the simple route:
data want;
set have nobs=_nobs_;
if _n_ / _nobs_ < 0.2 or _n_ / _nobs_ > 0.9;
run;
SET with POINT= is a good solution here:
Data top20 bottom10; do _N_=1 to ceil(nobs*.2); set have point=_N_ nobs=nobs; output top20; end; do _N_=ceil(nobs*.9) to nobs; set have point=_N_; output bottom10; end; stop; run;
I used the CEIL for the first loop so as to get at least one observation, even if there are less than 5 observations in the input. But you may want to use ROUND instead. In the bottom you will always get at least one obs, unless you use something like ROUND((NOBS+1)*.9).
Hi @ybz12003 .. it will be something like below ,
data test;
infile datalines dsd;
input Group RR;
obs=_n_;
datalines;
372, 9.78,
840, 10.2,
320, 10.87,
341, 11.02,
341, 11.02
370, 11.18,
320, 11.61,
322, 11.63,
340, 11.85,
346, 11.97,
343, 12.47,
371, 13.51,
620, 13.89,
830, 16.98,
350, 26.8,
145, 29.9,
830, 32.53,
510, 36.05,
180, 40.64,
181, 46.82,
;
run;
proc sort data= test out=test1;
by descending obs;
run;
data test2;
set test1;
oops1=_n_;
run;
proc sql;
create table first_20 as select group,RR from test
having obs/ Max(obs) <=0.2;
create table last_10 as select group,RR from test2
having oops1/ Max(oops1) <=0.1;
quit;
Thanks for everyone's great advice. So far, I have the codes written by Astound and S_lassen are working in my practice. But one observation is different between Astound and S_lassen's codes. Though Astound is simpler, S_lassen is more comprehensive to cover the decimal issue.
In the end, I truly thank everyone spending your precious time answer my question and helping me!
Percentiles don't have a standard definition, kind of like a month, so you can have variability in the answers.
@ybz12003 wrote:
Thanks for everyone's great advice. So far, I have the codes written by Astound and S_lassen are working in my practice. But one observation is different between Astound and S_lassen's codes. Though Astound is simpler, S_lassen is more comprehensive to cover the decimal issue.
In the end, I truly thank everyone spending your precious time answer my question and helping me!
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.