DATA Step, Macro, Functions and more

Replacing outliers with missing values

Accepted Solution Solved
Reply
Respected Advisor
Posts: 4,651
Accepted Solution

Replacing outliers with missing values

Hello wise people, I wonder if there is a better way to remove outliers from a dataset than this. I would particularly like to get rid of the special missing value.

Dataset myData contains variables date and station which identify observations uniquely. It also contains many numeric variables such as cond, c_org_diss, C_inor_diss, etc. I create another dataset outliersList with variables date and station and the name of the variable which should be set to missing. Here is what I have :

data outliersList;
format date yymmdd10.;
length var $32;
input date :yymmdd10. station $ var $;
value = .X;   /* Special missing value */
datalines;
2009-07-17 BSF-1 C_inor_diss
2009-07-17 BSF-1 C_org_diss
2009-07-17 BSF-2 C_inor_diss
2009-07-17 BSF-2 C_org_diss
2010-05-17 BSF-3 cond
2010-07-07 BSF-2 C_inor_diss
2010-07-07 BSF-2 C_org_diss
;

proc transpose data=outliersList out=outliersUpdate(drop=_NAME_);
by date station;
id var;
var value;
run;

data myData;
update myData outliersUpdate;
by date station;
run;

PG

PG

Accepted Solutions
Solution
‎01-26-2013 11:28 PM
Super User
Super User
Posts: 6,500

Re: Replacing outliers with missing values

You can just replace the .X with . by looping over all of your numeric variables.

data want ;

  update master transactions ;

  array x _numeric_;

  do over x ;  if x=.X then x=.; end;

run;

But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing.  Checking the documentation you can see why:

If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.

View solution in original post


All Replies
Super Contributor
Posts: 1,636

Re: Replacing outliers with missing values

Hi PG,

sorry,I am not wise enough to understand your question:smileysilly::smileyconfused::smileyblush:. You say " I would particularly like to get rid of the special missing value.", then you let value = .X;   /* Special missing value */.

Don't you get a lot of .X after you run your code?

data myData;

update myData outliersUpdate;

by date station;

run;

Respected Advisor
Posts: 4,651

Re: Replacing outliers with missing values

Hi Linlin, I get .X only at the places where outliers were removed. But I would rather have "normal" missing values in those places.  - PG

PG
Super User
Super User
Posts: 6,500

Re: Replacing outliers with missing values

In your first step replace the .X with regular missing. Then use the UPDATEMODE=nomissingcheck option on the UPDATE statement.

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a002645456.htm

Respected Advisor
Posts: 4,651

Re: Replacing outliers with missing values

Thanks Tom, but that would'n work. It would create an update dataset filled with missing values, the ones corresponding to my desired changes and the ones created by proc transpose. The update step would then zap all variables mentioned in the update dataset for all observations that match. That's not what I want. I want to zap only one variable per line in the update dataset. That's why I use the special missing value; it is treated differently than the missing values generated by transpose. - PG

PG
Solution
‎01-26-2013 11:28 PM
Super User
Super User
Posts: 6,500

Re: Replacing outliers with missing values

You can just replace the .X with . by looping over all of your numeric variables.

data want ;

  update master transactions ;

  array x _numeric_;

  do over x ;  if x=.X then x=.; end;

run;

But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing.  Checking the documentation you can see why:

If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.
Respected Advisor
Posts: 4,651

Re: Replacing outliers with missing values

You're right. That is a simple solution. Thanks!

PG
Respected Advisor
Posts: 3,777

Re: Replacing outliers with missing values

I'm confused.  Wouldn't it be better for PGStats to just change VALUE=.X to VALUE=._;  Then you don't need loop on array.

Respected Advisor
Posts: 4,651

Re: Replacing outliers with missing values

Well, this special missing values business can rapidly become very messy. I would rather avoid it entirely. Tom's idea brings me back to safe ground and I like that. Look at this :

data _null_;

y = ._;

if y = .

then put "Yippee!";

else put "Bummer!";

run;

They would look the same, have the same meaning and yet not be equal. A recipe for cryptic errors. Frightening!

PG

PG
Respected Advisor
Posts: 3,777

Re: Replacing outliers with missing values

._ updates non missing to . "standard missing" in the new master.

With regards to you out of context example.


17         data _null_;
18            y = ._;
19            if missing(y)
20               then put "Yippee!";
21               else put "Bummer!";
22            run;

Yippee!

Super User
Super User
Posts: 6,500

Re: Replacing outliers with missing values

If you use ._ in the TRANSACTION file the values they replace will become regular missing in the result dataset.  So it saves having to convert them yourself.  You want to be able to distinguish between the missing and the special missing in the transaction dataset.

Respected Advisor
Posts: 4,651

Re: Replacing outliers with missing values

Thank you DN and Tom for your wisdom! Now I get it! - PG

PG
☑ This topic is SOLVED.

Need further help from the community? Please ask a new question.

Discussion stats
  • 11 replies
  • 931 views
  • 6 likes
  • 4 in conversation