- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
How to get the obs between 2.5% and 97.5% percentile of the sum in the dataset? The sum is in dollar 10.2 format. Thanks.
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am going to assume that you have a variable in your data set named "Sum", and that you want the observations in the middle 95th percentile of that variable:
/* generate some example datas */
data Have;
format Sum DOLLAR10.2;
do i = 1 to 100;
Sum = 1000*i;
output;
end;
/* Find the 2.5th and 97.5th percentiles for the SUM variable */
proc univariate data=Have noprint;
var Sum;
output out=Pctls pctlpre=Pctl_ pctlpts=2.5, 97.5;
run;
/* copy the lower and upper pctl value into macro variables */
data _null_;
set Pctls;
call symputx("LowerPctl", Pctl_2_5);
call symputx("UpperPctl", Pctl_97_5);
run;
/* extract the middle 95% of the data */
data Middle;
set Have;
where &LowerPctl <= Sum and Sum <= &UpperPctl;
run;
If you want the sum of these values, you can use PROC MEANS:
proc means data=middle SUM;
var Sum;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I assume you mean PERCENTILE, but you did not say that.
If that is correct, then this makes sense "How to get the obs between 2.5% and 97.5%" but the next words "of the sum" make no sense in this context. Please explain further.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
percentile
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I asked for an explanation, I don't see one.
Paige Miller
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
If you mean "sum the values of the observations between the 2.5th and 97.5 percentiles of the variable"
Steps: 1) determine the given percentiles, Proc univariate may be best here.
2) filter the data with a Where statement or data set option using the values of the percentiles
3) sum the variable
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am going to assume that you have a variable in your data set named "Sum", and that you want the observations in the middle 95th percentile of that variable:
/* generate some example datas */
data Have;
format Sum DOLLAR10.2;
do i = 1 to 100;
Sum = 1000*i;
output;
end;
/* Find the 2.5th and 97.5th percentiles for the SUM variable */
proc univariate data=Have noprint;
var Sum;
output out=Pctls pctlpre=Pctl_ pctlpts=2.5, 97.5;
run;
/* copy the lower and upper pctl value into macro variables */
data _null_;
set Pctls;
call symputx("LowerPctl", Pctl_2_5);
call symputx("UpperPctl", Pctl_97_5);
run;
/* extract the middle 95% of the data */
data Middle;
set Have;
where &LowerPctl <= Sum and Sum <= &UpperPctl;
run;
If you want the sum of these values, you can use PROC MEANS:
proc means data=middle SUM;
var Sum;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Is your data already sorted? If so, you can use the metadata item NOBS (number of obs in the dataset) to generate an upper and lower bound for observations to be submitted to the summing process:
proc sql noprint;
select nobs, floor(nobs/40)+1) , floor(nobs-nobs/40)
into :nobs, :LOWER_BOUND , :UPPER_BOUND
from dictionary.tables where libname='WORK' and memname='HAVE';
quit;
%put &=nobs &=lower_bound &=upper_bound;
proc means data=have (firstobs=&lower_bound obs=&upper_bound) sum n min max;
var x;
run;
To be precise, this code EXCLUDES as close to the lower and upper 2.5% that is achievable, but never excludes less than the upper and lower 2.5 percent.
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set
Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets
--------------------------