SAS/IML Software and Matrix Computations

Statistical programming, matrix languages, and more
BookmarkSubscribeRSS Feed
harmonic
Quartz | Level 8

Hello community,

 

I would like to apply outlier detection using the hampel identifier, I've found this article

https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html

but I want to apply that to my series.

 

I have a sample dataset for example like that

 

data DIFF_Q_NUM;
    format date_gas yymmdd10.; /* Formatta la data */
    do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
        do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
            do i = 1 to 3; /* Tre zone climatiche per ogni regione */
                zona_clim = byte(64 + i); /* Converte 1->A, 2->B, 3->C */
                diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
                output;
            end;
        end;
    end;
run;


 

 

My y target is variable 'diff', I would like to find the outliers for each combination of cod_reg_clim and zona_clim

 

and thick them in my dataset.

 

Thank you in advance

 

 

 

5 REPLIES 5
Rick_SAS
SAS Super FREQ

It sounds like you want to perform the same analysis for each unique combination of subset of  cod_reg_clim and zona_clim

 

To do this, I suggest sorting the data by the groups. I also find it useful to add an index variable that enumerates the unique combinations:

 

data DIFF_Q_NUM;
    array clim[3] $1 _temporary_ ('A' 'B' 'C');
    format date_gas yymmdd10.; /* Formatta la data */
    do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
        do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
            do i = 1 to 3; /* Tre zone climatiche per ogni regione */
                zona_clim = clim[i]; 
                diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
                output;
            end;
        end;
    end;
run;

/* first, sort for BY-group analysis */
proc sort data=DIFF_Q_NUM out=DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim date_gas;
run;

/* add a single ID variable for each BY group */
data DIFF_Q_NUM_SORT;
set DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim;
if first.zona_clim then 
   ID + 1;
run;

Now the data are ready for a BY-group analysis. In PROC IML, there are several ways to do this. The easiest to explain is the unique-loc method. See The UNIQUE-LOC trick: A real treat! - The DO Loop

But also see the article BY-group processing in SAS/IML - The DO Loop

 

The following program reads all observations into IML vectors, then extracts the time series for each combination of the BY groups. It calls the HameplID function that you want to use on each time series and forms a vector that indicates which observations are outliers. It then writes that vector to a SAS data set. You can merge the outlier vector and the sorted data. 

proc iml;
/* Function from https://blogs.sas.com/content/iml/2021/05/26/running-median-smoother.html
   Extract the symmetric sequence about x[i], which is x[i-k-1:i+k+1] if
   1 <= i-k-1  and  i+k+1 <= nrow(x).
   Otherwise, pad the sequence. You can pad with a user-defined value 
   (such as missing or zero) or you can repeat the first or last value.
   This function loops over the number of columns and extracts each window.
   INPUT: y is the time series
          k is the radius of the window. The window length is 2*k+1 centered at y[i]
          padVal can be missing or a character value. If character,
                 use y[1] to pad on the left and y[n] to pad on the right.
*/
start extractWindows(y, k, padVal="REPEAT");
   n = nrow(y);
   M = j(2*k+1, n, .);
   /* create new series, z, by padding k values to the ends of y */
   if type(padVal)='N' then
      z = repeat(padVal, k) // y // repeat(padVal, k); /* pad with user value */
   else
      z = repeat(y[1], k) // y // repeat(y[n], k);     /* Tukey's padding */
   do i = 1 to n;
      range = i : i+2*k;   /* centered at k+i */
      M[,i] = z[range];
   end;
   return M;
finish;
/* Hampel identifier: Use moving median and MAD to identify outliers 
   https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html */
start HampelID(y, k, multiplier=3);
   M = extractWindows(y, k, "REPEAT");
   med = T( median(M) );
   Nmad = T( mad(M, "NMAD") ); 
   idx = loc( abs(y-med) > multiplier*Nmad );
   return idx;
finish;

/* read in the series for each BY-group combination of cod_reg_clim and zona_clim */
use DIFF_Q_NUM_SORT;
read all var {diff ID};
close;

winLen = 6;                    /* set the window length */
Outlier = j(nrow(diff), 1, 0);  /* most obs are not outliers */
firstObs = 1;
do i = 1 to max(ID);
   groupIdx = loc(ID=i);
   y = diff[groupIdx];
   outlierIdx = HampelID(y, winLen);
   if ncol(OutlierIdx) > 0 then    /* there are outliers for this group */
      Outlier[firstObs - 1 + outlierIdx] = 1;   /* mark the outliers */
   firstObs = firstObs + ncol(groupIdx);
end;
create HampelOutliers var "Outlier";
append;
close HampelOutliers;
QUIT;

/* merge the Hampel outlier variable */
data DIFF_Q_HAMPEL;
merge DIFF_Q_NUM_SORT HampelOutliers;
run;
/* visualize the time series and outliers */
proc sgpanel data=DIFF_Q_HAMPEL;
panelby zona_clim cod_reg_clim / layout=lattice onepanel; 
scatter x=date_gas y=diff / group=Outlier;
run;

Capture.PNG

harmonic
Quartz | Level 8

How can I verify that the filter works? Showing the band lower and upper to check the outliers points?


Rick_SAS
SAS Super FREQ

That is shown in the blog post: The Hampel identifier: Robust outlier detection in a time series - The DO Loop

Try to figure it out and write back if you get stuck.

harmonic
Quartz | Level 8

Yes but in your code I can't see the lower and upper bound. I would like to see in my dataset the bands because I see that the extreme values are not detected as outliers.

harmonic_0-1741795013240.png

 

Rick_SAS
SAS Super FREQ

Yes. I was suggesting that you to try to implement the method yourself. You can use the following function to return the rolling median and the Hampel bands for each series:

start HampelPred(y, k, multiplier=3);
   M = extractWindows(y, k, "REPEAT");
   med = T( median(M) );
   Nmad = T( mad(M, "NMAD") ); 
   pred = med;
   lower = pred - multiplier*Nmad;
   upper = pred + multiplier*Nmad;
   return pred || lower || upper;
finish;

Now call that function inside the loop over the BY groups and include the results in the output data set.

sas-innovate-white.png

Our biggest data and AI event of the year.

Don’t miss the livestream kicking off May 7. It’s free. It’s easy. And it’s the best seat in the house.

Join us virtually with our complimentary SAS Innovate Digital Pass. Watch live or on-demand in multiple languages, with translations available to help you get the most out of every session.

 

Register now!

From The DO Loop
Want more? Visit our blog for more articles like these.
Discussion stats
  • 5 replies
  • 850 views
  • 2 likes
  • 2 in conversation