Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Programming
- /
- SAS Procedures
- /
- How to exclude highest and lowest value from dataset?

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-28-2011 05:58 AM
(8915 views)

Hi,

google wasn't much help on my problem so I try to find a solution here.

I have 12 observations from an experiment but only want to use 10 for the Anova. Therefore I want to exclude the lowest and the highest value. which command in the data step does that? i have the observation sorted by size with proc sort but don't find the right coding to do that.

Thanks for your help.

9 REPLIES 9

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

After you sort the data by the variable, use the first. and last. to delete the records.

For example:

proc sort data=test;

by x;

data test;

set test;

by x;

if first.x then delete;

if last.x then delete;

run;

Dataset test should now have every record but the lowest and highest value of x.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

proc sort data=dummy;

by l;

run;

data dummy;

set dummy;

by l;

if first.l then delete;

if last.l then delete;

run;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Unless I've misunderstood the OP, I don't think this would work unless the dataset were sorted by an additional field that was the same in all rows. Worst case, if all values are unique then every row would be deleted.

For example:

**data **test;

doi = **1 **to **12**;

n = ranuni(-**1**);

output;

end;

drop i;

**run**;

**proc sort** data=test;

by n;

**run**;

**data **test_out1;

set test;

by n;

if first.n then delete;

if last.n then delete;

**run**;

This results in no rows ever being output. It would work perfectly if there were an additional field that was also part of the sort (e.g. 1 or more categories): such as below:

**data **test2;

cat = **1**;

do i = **1** to**12**;

n = ranuni(-**1**);

output;

end;

cat = **2**;

do i = **1** to**12**;

n = ranuni(-**1**);

output;

end;

drop i;

**run**;

**proc sort** data=test2;

by cat n;

**run**;

**data **test_out2;

set test2;

by cat;

if first.cat then delete;

if last.cat then delete;

**run**;

One question for the op though - what happens if the first two rows have the same value? If you would want to remove both, then a two step approach is probably required - use proc freq (etc.) to pull the min/max values, and then link the results to the original dataset to remove these relevant rows.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If the dataset is sorted properly, and you truly only wanted to delete the very first record and the very last record, then this simple snippet will do it:

data truncated;

set original end=eof;

if _n_ = 1 | eof then delete;

run;

If there are multiple smallest values or maximum values, only one of them will be deleted.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

"If _n_=1| eof then delete;" means delete the first or the last observation.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

proc sql;

create table work.anova_obs as

select * from work.obs

where value > (select min(value) from work.obs)

and value <(select max(value) from work.obs);

quit;

That should work if the value column doesn't have duplicates. if it does and you only want to exclude one of them, then something like this might work.

proc sql;

select count(*)-1, count(*)-2 into :FirstObs, :SecondObs from work.obs;

quit;

proc sort in=work.obs out=work.obs;

by value;

run;

option obs=&FirstObs;

data work.temp;

set work.obs;

run;

proc sort in=work.temp out=work.temp;

by descending value;

run;

option obs=&SecondObs;

data work.anova_obs;

set work.temp

run;

option obs=max;

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

You have the data sorted by size already? Then

data want;

set have (firstobs=2) end=end_of_have;

if end_of_have=0;

run;

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for

Allow PROC SORT to output multiple datasets

--------------------------

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

To help with your Googling, this technique (removing the extreme observations) is known as "trimming". You can use PROC UNIVARIATE to get the trimmed means. You can read about some of the statistical shortcomings that are associated with trimming (or "Winsorizing") data,

If you are trying to perform a robust ANOVA, the best way is to use a statistical procedure that is not sensitive to outliers. There is an example in the SAS documentation that shows how to use PROC ROBUSTREG to perform a robust ANOVA. By using PROC ROBUSTREG, you will not throw away good data unnecessarily, which is what happens when you automatically discar the minimum and maximum values in each group.

Registration is open! SAS is returning to Vegas for an AI and analytics experience like no other! Whether you're an executive, manager, end user or SAS partner, SAS Innovate is designed for everyone on your team. Register for just $495 by 12/31/2023.

**If you are interested in speaking, there is still time to submit a session idea. More details are posted on the website. **

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.