Hello wise people, I wonder if there is a better way to remove outliers from a dataset than this. I would particularly like to get rid of the special missing value.
Dataset myData contains variables date and station which identify observations uniquely. It also contains many numeric variables such as cond, c_org_diss, C_inor_diss, etc. I create another dataset outliersList with variables date and station and the name of the variable which should be set to missing. Here is what I have :
data outliersList;
format date yymmdd10.;
length var $32;
input date :yymmdd10. station $ var $;
value = .X; /* Special missing value */
datalines;
2009-07-17 BSF-1 C_inor_diss
2009-07-17 BSF-1 C_org_diss
2009-07-17 BSF-2 C_inor_diss
2009-07-17 BSF-2 C_org_diss
2010-05-17 BSF-3 cond
2010-07-07 BSF-2 C_inor_diss
2010-07-07 BSF-2 C_org_diss
;
proc transpose data=outliersList out=outliersUpdate(drop=_NAME_);
by date station;
id var;
var value;
run;
data myData;
update myData outliersUpdate;
by date station;
run;
PG
You can just replace the .X with . by looping over all of your numeric variables.
data want ;
update master transactions ;
array x _numeric_;
do over x ; if x=.X then x=.; end;
run;
But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing. Checking the documentation you can see why:
If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.
Hi PG,
sorry,I am not wise enough to understand your question:smileysilly::smileyconfused::smileyblush:. You say " I would particularly like to get rid of the special missing value.", then you let value = .X; /* Special missing value */.
Don't you get a lot of .X after you run your code?
data myData;
update myData outliersUpdate;
by date station;
run;
Hi Linlin, I get .X only at the places where outliers were removed. But I would rather have "normal" missing values in those places. - PG
In your first step replace the .X with regular missing. Then use the UPDATEMODE=nomissingcheck option on the UPDATE statement.
http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a002645456.htm
Thanks Tom, but that would'n work. It would create an update dataset filled with missing values, the ones corresponding to my desired changes and the ones created by proc transpose. The update step would then zap all variables mentioned in the update dataset for all observations that match. That's not what I want. I want to zap only one variable per line in the update dataset. That's why I use the special missing value; it is treated differently than the missing values generated by transpose. - PG
You can just replace the .X with . by looping over all of your numeric variables.
data want ;
update master transactions ;
array x _numeric_;
do over x ; if x=.X then x=.; end;
run;
But when we used to use actual transaction datasets 30 years ago we always used special missing ._ (dot underscore) to convert a value to missing. Checking the documentation you can see why:
If you want the resulting value in the master data set to be a regular missing value, use a single underscore (_) to represent missing values in the transaction data set. The resulting value in the master data set will be a period (.) for missing numeric values and a blank for missing character values.
You're right. That is a simple solution. Thanks!
I'm confused. Wouldn't it be better for PGStats to just change VALUE=.X to VALUE=._; Then you don't need loop on array.
Well, this special missing values business can rapidly become very messy. I would rather avoid it entirely. Tom's idea brings me back to safe ground and I like that. Look at this :
data _null_;
y = ._;
if y = .
then put "Yippee!";
else put "Bummer!";
run;
They would look the same, have the same meaning and yet not be equal. A recipe for cryptic errors. Frightening!
PG
._ updates non missing to . "standard missing" in the new master.
With regards to you out of context example.
17 data _null_;
18 y = ._;
19 if missing(y)
20 then put "Yippee!";
21 else put "Bummer!";
22 run;
Yippee!
If you use ._ in the TRANSACTION file the values they replace will become regular missing in the result dataset. So it saves having to convert them yourself. You want to be able to distinguish between the missing and the special missing in the transaction dataset.
Thank you DN and Tom for your wisdom! Now I get it! - PG
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.