Obsidian | Level 7

## replacing extreme values with missing

I have a data set with both continuous and categorical variables. I need to find extreme values and replace them as missing values for the continuous variables. I've gotten this far:

``````/* Calculate Median and IQR */
PROC UNIVARIATE DATA = kddcup98 NOPRINT;
VAR DemAge
DemMedHomeValue
DemMedIncome
DemPctVeterans
PromCnt12
PromCnt36
PromCntAll
PromCntCard12
PromCntCard36
PromCntCardAll
TARGET_D;
OUTPUT OUT = boxStats p25 = p25 p75 = p75 QRANGE = iqr;
RUN;

DATA _null_;
SET boxStats;
CALL symput ('p25',p25);
CALL symput ('p75',p75);
CALL symput ('iqr', iqr);
RUN;

%PUT &p25;
%PUT &p75;
%PUT &iqr;

DATA trimmed;
SET kddcup98;
ARRAY change _numeric_;
DO OVER change;
IF (change > &p75 + 1.5 * &iqr) OR (change < &p25 - 1.5 * &iqr) THEN change = .;
END;
RUN;

/* List Variables with Missing Values */
PROC MEANS DATA=trimmed NMISS N;
TITLE 'trimmed Variables with Number of Missing Values (NMISS) and Number of Numeric Values (N)';
RUN;``````

The only problem is that is miscalculates the number of extreme values. In some cases, it considers most of the values as extreme.

SAS Super FREQ

## Re: replacing extreme values with missing

Are you trying to trim or Winsorize each variable? If so, please read "Winsorization: The good, the bad, and the ugly," which discusses the statistical implications of getting rid of extreme values. If you decide to proceed and Winsorize your data, the article also contains links to a second article about how to Winsorize, and you can easily modify it to replace extreme values with missing values.

If you only want the trimmed or Winsorized means and StdDev, you can use the ROBUSTSCALE option, the TRIMMED= option, and the WINSORIZED= option to obtain robust estimates without modifying the original data.

Discussion stats