DATA Step, Macro, Functions and more

Replacing Outlier with 95 percentile

Reply
Contributor
Posts: 38

Replacing Outlier with 95 percentile

Hallo,

 

do anyone here know an efficient way of replacing (upper) extreme values with (95th) percentile.

I have a Dataset with around 600 variables and want to get rid of  extreme values of all 600 variables with 95th percentile without doing a lot of copy past work.

 

Any idea ?

 

thanks in advance

Ehsan

SAS Super FREQ
Posts: 3,753

Re: Replacing Outlier with 95 percentile

Posted in reply to ehsanmath

This is called "Winsorizing" the data.  When you Winsorize, you replace extreme values by a specified percentile. Usually Winsorization is a two-sided operation. For example, you replace small values by the 5th percentile and large values by the 95th percentile.

 

If your goal is to compute the Winsorized mean, this comes out of PROC UNIVARIATE automatically by using the WINSORIZED=0.05 option. 

 

If you absolutely insist on Winsorizing the data yourself, it can get a little tricky. Search this forum for "winsorize" and see comments by @FreelanceReinhard. You can also read the article "How to Winsorize data in SAS", which explains some of the issues and offers a SAS/IML solution.

Super User
Posts: 10,028

Re: Replacing Outlier with 95 percentile

Posted in reply to ehsanmath
Contributor
Posts: 38

Re: Replacing Outlier with 95 percentile

Hi guys,

 

Thanks a lot for your suggestions and help.

With your help i wrote the followoing code to solve my problem. I am just copy pasting my code here

--------------------------------START-------------------------------------------------

proc sql;
        select
            trim(name), trim(name)||'p' into :avl_selected_100K_spalten separated by ' ', :avl_selected_100K_spalten_P99 separated by ' '
        from
            dictionary.columns
        where
            libname = 'WORK'
            and
            memname = 'AVL_SELECTED_100K_PATCH'
            and
            upcase(name) not in('PERSON_ID');
    quit;

proc univariate data=AVL_SELECTED_100K_Patch noprint;
   var &avl_selected_100K_spalten.;
   output out=AVL_SELECTED_100K_P99 pctlpts  =99
                           pctlpre  = &avl_selected_100K_spalten.
                           pctlname = p;
run;

data AVL_SELECTED_100K_P99;
    set AVL_SELECTED_100K_P99;
    _TYPE_=0;
run;

PROC SORT
    DATA=AVL_SELECTED_100K_P99
    OUT=AVL_SELECTED_100K_P99_sort;
    BY _TYPE_;
RUN;

PROC TRANSPOSE DATA=AVL_SELECTED_100K_P99_sort
    OUT=Trans_AVL_SELECTED_100K_P99_sort
    PREFIX=Column
    NAME=Column_name
    LABEL=Label;
    BY _TYPE_;
    VAR &avl_selected_100K_spalten_P99.;
RUN;

data AVL_SELECTED_100K_P99 (drop=_TYPE_ Label rename=(column1=P99));
    set Trans_AVL_SELECTED_100K_P99_sort;
    Column_name=substrn(Column_name,1,length(Column_name)-1);
run;

proc sql ;
        select Column_name, P99 into :Column_nameP99 separated by ' ', :Column_nameP99_value separated by ' ' from AVL_SELECTED_100K_P99;
    quit;

%macro computePercentile();

    %let count = 0;
        %do %while (%qscan(&Column_nameP99.,%eval(&count + 1),%str( )) ne );
            %let count = %eval(&count + 1);
        %end;


    %do i = 1 %to &count;

    %let currentVar&i = %qscan(&Column_nameP99.,&i,%str( ));
    %let currentValue&i = %qscan(&Column_nameP99_value.,&i,%str( ));

    data AVL_SELECTED_100K_Patch;

        set AVL_SELECTED_100K_Patch;
 
        if &&currentVar&i.. gt &&currentValue&i.. then &&currentVar&i.. = &&currentValue&i..;
         
    run;
    %end;

%mend;

%computePercentile();

--------------------------------------END-----------------------------------------------

Ask a Question
Discussion stats
  • 3 replies
  • 834 views
  • 0 likes
  • 3 in conversation