SAS/IML Software and Matrix Computations

harmonic · Posted 03-11-2025 12:24 PM

Hello community,

I would like to apply outlier detection using the hampel identifier, I've found this article

https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html

but I want to apply that to my series.

I have a sample dataset for example like that

data DIFF_Q_NUM;
    format date_gas yymmdd10.; /* Formatta la data */
    do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
        do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
            do i = 1 to 3; /* Tre zone climatiche per ogni regione */
                zona_clim = byte(64 + i); /* Converte 1->A, 2->B, 3->C */
                diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
                output;
            end;
        end;
    end;
run;

My y target is variable 'diff', I would like to find the outliers for each combination of cod_reg_clim and zona_clim

and thick them in my dataset.

Thank you in advance

Rick_SAS · Posted 03-12-2025 10:50 AM

It sounds like you want to perform the same analysis for each unique combination of subset of cod_reg_clim and zona_clim

To do this, I suggest sorting the data by the groups. I also find it useful to add an index variable that enumerates the unique combinations:

data DIFF_Q_NUM;
    array clim[3] $1 _temporary_ ('A' 'B' 'C');
    format date_gas yymmdd10.; /* Formatta la data */
    do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
        do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
            do i = 1 to 3; /* Tre zone climatiche per ogni regione */
                zona_clim = clim[i]; 
                diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
                output;
            end;
        end;
    end;
run;

/* first, sort for BY-group analysis */
proc sort data=DIFF_Q_NUM out=DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim date_gas;
run;

/* add a single ID variable for each BY group */
data DIFF_Q_NUM_SORT;
set DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim;
if first.zona_clim then 
   ID + 1;
run;

Now the data are ready for a BY-group analysis. In PROC IML, there are several ways to do this. The easiest to explain is the unique-loc method. See The UNIQUE-LOC trick: A real treat! - The DO Loop

But also see the article BY-group processing in SAS/IML - The DO Loop

The following program reads all observations into IML vectors, then extracts the time series for each combination of the BY groups. It calls the HameplID function that you want to use on each time series and forms a vector that indicates which observations are outliers. It then writes that vector to a SAS data set. You can merge the outlier vector and the sorted data.

proc iml;
/* Function from https://blogs.sas.com/content/iml/2021/05/26/running-median-smoother.html
   Extract the symmetric sequence about x[i], which is x[i-k-1:i+k+1] if
   1 <= i-k-1  and  i+k+1 <= nrow(x).
   Otherwise, pad the sequence. You can pad with a user-defined value 
   (such as missing or zero) or you can repeat the first or last value.
   This function loops over the number of columns and extracts each window.
   INPUT: y is the time series
          k is the radius of the window. The window length is 2*k+1 centered at y[i]
          padVal can be missing or a character value. If character,
                 use y[1] to pad on the left and y[n] to pad on the right.
*/
start extractWindows(y, k, padVal="REPEAT");
   n = nrow(y);
   M = j(2*k+1, n, .);
   /* create new series, z, by padding k values to the ends of y */
   if type(padVal)='N' then
      z = repeat(padVal, k) // y // repeat(padVal, k); /* pad with user value */
   else
      z = repeat(y[1], k) // y // repeat(y[n], k);     /* Tukey's padding */
   do i = 1 to n;
      range = i : i+2*k;   /* centered at k+i */
      M[,i] = z[range];
   end;
   return M;
finish;
/* Hampel identifier: Use moving median and MAD to identify outliers 
   https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html */
start HampelID(y, k, multiplier=3);
   M = extractWindows(y, k, "REPEAT");
   med = T( median(M) );
   Nmad = T( mad(M, "NMAD") ); 
   idx = loc( abs(y-med) > multiplier*Nmad );
   return idx;
finish;

/* read in the series for each BY-group combination of cod_reg_clim and zona_clim */
use DIFF_Q_NUM_SORT;
read all var {diff ID};
close;

winLen = 6;                    /* set the window length */
Outlier = j(nrow(diff), 1, 0);  /* most obs are not outliers */
firstObs = 1;
do i = 1 to max(ID);
   groupIdx = loc(ID=i);
   y = diff[groupIdx];
   outlierIdx = HampelID(y, winLen);
   if ncol(OutlierIdx) > 0 then    /* there are outliers for this group */
      Outlier[firstObs - 1 + outlierIdx] = 1;   /* mark the outliers */
   firstObs = firstObs + ncol(groupIdx);
end;
create HampelOutliers var "Outlier";
append;
close HampelOutliers;
QUIT;

/* merge the Hampel outlier variable */
data DIFF_Q_HAMPEL;
merge DIFF_Q_NUM_SORT HampelOutliers;
run;
/* visualize the time series and outliers */
proc sgpanel data=DIFF_Q_HAMPEL;
panelby zona_clim cod_reg_clim / layout=lattice onepanel; 
scatter x=date_gas y=diff / group=Outlier;
run;

harmonic · Posted 03-12-2025 11:20 AM

How can I verify that the filter works? Showing the band lower and upper to check the outliers points?

Rick_SAS · Posted 03-12-2025 11:25 AM

That is shown in the blog post: The Hampel identifier: Robust outlier detection in a time series - The DO Loop

Try to figure it out and write back if you get stuck.

harmonic · Posted 03-12-2025 11:57 AM

Yes but in your code I can't see the lower and upper bound. I would like to see in my dataset the bands because I see that the extreme values are not detected as outliers.

Rick_SAS · Posted 03-13-2025 06:36 AM

Yes. I was suggesting that you to try to implement the method yourself. You can use the following function to return the rolling median and the Hampel bands for each series:

start HampelPred(y, k, multiplier=3);
   M = extractWindows(y, k, "REPEAT");
   med = T( median(M) );
   Nmad = T( mad(M, "NMAD") ); 
   pred = med;
   lower = pred - multiplier*Nmad;
   upper = pred + multiplier*Nmad;
   return pred || lower || upper;
finish;

Now call that function inside the loop over the BY groups and include the results in the output data set.

SAS/IML Software and Matrix Computations

Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Follow Us

What is...

SAS/IML Software and Matrix Computations

Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Re: Hampel identifier for outlier detection

Our biggest data and AI event of the year.

Follow Us

What is...