BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
PGStats
Opal | Level 21

Hello wise people, I wonder if there is a better way to remove outliers from a dataset than this. I would particularly like to get rid of the special missing value.

Dataset myData contains variables date and station which identify observations uniquely. It also contains many numeric variables such as cond, c_org_diss, C_inor_diss, etc. I create another dataset outliersList with variables date and station and the name of the variable which should be set to missing. Here is what I have :

data outliersList;
format date yymmdd10.;
length var $32;
input date :yymmdd10. station $ var $;
value = .X;   /* Special missing value */
datalines;
2009-07-17 BSF-1 C_inor_diss
2009-07-17 BSF-1 C_org_diss
2009-07-17 BSF-2 C_inor_diss
2009-07-17 BSF-2 C_org_diss
2010-05-17 BSF-3 cond
2010-07-07 BSF-2 C_inor_diss
2010-07-07 BSF-2 C_org_diss
;

proc transpose data=outliersList out=outliersUpdate(drop=_NAME_);
by date station;
id var;
var value;
run;

data myData;
update myData outliersUpdate;
by date station;
run;

PG

PG
1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

You can just replace the .X with . by looping over all of your numeric variables.

data want ;

  update master transactions ;

  array x _numeric_;

  do over x ;  if x=.X then x=.; end;

run;

But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing.  Checking the documentation you can see why:

If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.

View solution in original post

11 REPLIES 11
Linlin
Lapis Lazuli | Level 10

Hi PG,

sorry,I am not wise enough to understand your question:smileysilly::smileyconfused::smileyblush:. You say " I would particularly like to get rid of the special missing value.", then you let value = .X;   /* Special missing value */.

Don't you get a lot of .X after you run your code?

data myData;

update myData outliersUpdate;

by date station;

run;

PGStats
Opal | Level 21

Hi Linlin, I get .X only at the places where outliers were removed. But I would rather have "normal" missing values in those places.  - PG

PG
Tom
Super User Tom
Super User

In your first step replace the .X with regular missing. Then use the UPDATEMODE=nomissingcheck option on the UPDATE statement.

http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a002645456.htm

PGStats
Opal | Level 21

Thanks Tom, but that would'n work. It would create an update dataset filled with missing values, the ones corresponding to my desired changes and the ones created by proc transpose. The update step would then zap all variables mentioned in the update dataset for all observations that match. That's not what I want. I want to zap only one variable per line in the update dataset. That's why I use the special missing value; it is treated differently than the missing values generated by transpose. - PG

PG
Tom
Super User Tom
Super User

You can just replace the .X with . by looping over all of your numeric variables.

data want ;

  update master transactions ;

  array x _numeric_;

  do over x ;  if x=.X then x=.; end;

run;

But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing.  Checking the documentation you can see why:

If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.
PGStats
Opal | Level 21

You're right. That is a simple solution. Thanks!

PG
data_null__
Jade | Level 19

I'm confused.  Wouldn't it be better for PGStats to just change VALUE=.X to VALUE=._;  Then you don't need loop on array.

PGStats
Opal | Level 21

Well, this special missing values business can rapidly become very messy. I would rather avoid it entirely. Tom's idea brings me back to safe ground and I like that. Look at this :

data _null_;

y = ._;

if y = .

then put "Yippee!";

else put "Bummer!";

run;

They would look the same, have the same meaning and yet not be equal. A recipe for cryptic errors. Frightening!

PG

PG
data_null__
Jade | Level 19

._ updates non missing to . "standard missing" in the new master.

With regards to you out of context example.


17         data _null_;
18            y = ._;
19            if missing(y)
20               then put "Yippee!";
21               else put "Bummer!";
22            run;

Yippee!

Tom
Super User Tom
Super User

If you use ._ in the TRANSACTION file the values they replace will become regular missing in the result dataset.  So it saves having to convert them yourself.  You want to be able to distinguish between the missing and the special missing in the transaction dataset.

PGStats
Opal | Level 21

Thank you DN and Tom for your wisdom! Now I get it! - PG

PG

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 11 replies
  • 4092 views
  • 6 likes
  • 4 in conversation