- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hello community,
I would like to apply outlier detection using the hampel identifier, I've found this article
https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html
but I want to apply that to my series.
I have a sample dataset for example like that
data DIFF_Q_NUM;
format date_gas yymmdd10.; /* Formatta la data */
do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
do i = 1 to 3; /* Tre zone climatiche per ogni regione */
zona_clim = byte(64 + i); /* Converte 1->A, 2->B, 3->C */
diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
output;
end;
end;
end;
run;
My y target is variable 'diff', I would like to find the outliers for each combination of cod_reg_clim and zona_clim
and thick them in my dataset.
Thank you in advance
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
It sounds like you want to perform the same analysis for each unique combination of subset of cod_reg_clim and zona_clim
To do this, I suggest sorting the data by the groups. I also find it useful to add an index variable that enumerates the unique combinations:
data DIFF_Q_NUM;
array clim[3] $1 _temporary_ ('A' 'B' 'C');
format date_gas yymmdd10.; /* Formatta la data */
do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
do i = 1 to 3; /* Tre zone climatiche per ogni regione */
zona_clim = clim[i];
diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
output;
end;
end;
end;
run;
/* first, sort for BY-group analysis */
proc sort data=DIFF_Q_NUM out=DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim date_gas;
run;
/* add a single ID variable for each BY group */
data DIFF_Q_NUM_SORT;
set DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim;
if first.zona_clim then
ID + 1;
run;
Now the data are ready for a BY-group analysis. In PROC IML, there are several ways to do this. The easiest to explain is the unique-loc method. See The UNIQUE-LOC trick: A real treat! - The DO Loop
But also see the article BY-group processing in SAS/IML - The DO Loop
The following program reads all observations into IML vectors, then extracts the time series for each combination of the BY groups. It calls the HameplID function that you want to use on each time series and forms a vector that indicates which observations are outliers. It then writes that vector to a SAS data set. You can merge the outlier vector and the sorted data.
proc iml;
/* Function from https://blogs.sas.com/content/iml/2021/05/26/running-median-smoother.html
Extract the symmetric sequence about x[i], which is x[i-k-1:i+k+1] if
1 <= i-k-1 and i+k+1 <= nrow(x).
Otherwise, pad the sequence. You can pad with a user-defined value
(such as missing or zero) or you can repeat the first or last value.
This function loops over the number of columns and extracts each window.
INPUT: y is the time series
k is the radius of the window. The window length is 2*k+1 centered at y[i]
padVal can be missing or a character value. If character,
use y[1] to pad on the left and y[n] to pad on the right.
*/
start extractWindows(y, k, padVal="REPEAT");
n = nrow(y);
M = j(2*k+1, n, .);
/* create new series, z, by padding k values to the ends of y */
if type(padVal)='N' then
z = repeat(padVal, k) // y // repeat(padVal, k); /* pad with user value */
else
z = repeat(y[1], k) // y // repeat(y[n], k); /* Tukey's padding */
do i = 1 to n;
range = i : i+2*k; /* centered at k+i */
M[,i] = z[range];
end;
return M;
finish;
/* Hampel identifier: Use moving median and MAD to identify outliers
https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html */
start HampelID(y, k, multiplier=3);
M = extractWindows(y, k, "REPEAT");
med = T( median(M) );
Nmad = T( mad(M, "NMAD") );
idx = loc( abs(y-med) > multiplier*Nmad );
return idx;
finish;
/* read in the series for each BY-group combination of cod_reg_clim and zona_clim */
use DIFF_Q_NUM_SORT;
read all var {diff ID};
close;
winLen = 6; /* set the window length */
Outlier = j(nrow(diff), 1, 0); /* most obs are not outliers */
firstObs = 1;
do i = 1 to max(ID);
groupIdx = loc(ID=i);
y = diff[groupIdx];
outlierIdx = HampelID(y, winLen);
if ncol(OutlierIdx) > 0 then /* there are outliers for this group */
Outlier[firstObs - 1 + outlierIdx] = 1; /* mark the outliers */
firstObs = firstObs + ncol(groupIdx);
end;
create HampelOutliers var "Outlier";
append;
close HampelOutliers;
QUIT;
/* merge the Hampel outlier variable */
data DIFF_Q_HAMPEL;
merge DIFF_Q_NUM_SORT HampelOutliers;
run;
/* visualize the time series and outliers */
proc sgpanel data=DIFF_Q_HAMPEL;
panelby zona_clim cod_reg_clim / layout=lattice onepanel;
scatter x=date_gas y=diff / group=Outlier;
run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
How can I verify that the filter works? Showing the band lower and upper to check the outliers points?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
That is shown in the blog post: The Hampel identifier: Robust outlier detection in a time series - The DO Loop
Try to figure it out and write back if you get stuck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes but in your code I can't see the lower and upper bound. I would like to see in my dataset the bands because I see that the extreme values are not detected as outliers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Yes. I was suggesting that you to try to implement the method yourself. You can use the following function to return the rolling median and the Hampel bands for each series:
start HampelPred(y, k, multiplier=3);
M = extractWindows(y, k, "REPEAT");
med = T( median(M) );
Nmad = T( mad(M, "NMAD") );
pred = med;
lower = pred - multiplier*Nmad;
upper = pred + multiplier*Nmad;
return pred || lower || upper;
finish;
Now call that function inside the loop over the BY groups and include the results in the output data set.