It sounds like you want to perform the same analysis for each unique combination of subset of cod_reg_clim and zona_clim
To do this, I suggest sorting the data by the groups. I also find it useful to add an index variable that enumerates the unique combinations:
data DIFF_Q_NUM;
array clim[3] $1 _temporary_ ('A' 'B' 'C');
format date_gas yymmdd10.; /* Formatta la data */
do date_gas = '01JAN2023'd to '31DEC2023'd; /* Loop sulle date per un anno */
do cod_reg_clim = 1 to 5; /* Cinque regioni climatiche */
do i = 1 to 3; /* Tre zone climatiche per ogni regione */
zona_clim = clim[i];
diff = rand("Normal", 0, 10); /* Valore casuale con media 0 e deviazione 10 */
output;
end;
end;
end;
run;
/* first, sort for BY-group analysis */
proc sort data=DIFF_Q_NUM out=DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim date_gas;
run;
/* add a single ID variable for each BY group */
data DIFF_Q_NUM_SORT;
set DIFF_Q_NUM_SORT;
by cod_reg_clim zona_clim;
if first.zona_clim then
ID + 1;
run;
Now the data are ready for a BY-group analysis. In PROC IML, there are several ways to do this. The easiest to explain is the unique-loc method. See The UNIQUE-LOC trick: A real treat! - The DO Loop
But also see the article BY-group processing in SAS/IML - The DO Loop
The following program reads all observations into IML vectors, then extracts the time series for each combination of the BY groups. It calls the HameplID function that you want to use on each time series and forms a vector that indicates which observations are outliers. It then writes that vector to a SAS data set. You can merge the outlier vector and the sorted data.
proc iml;
/* Function from https://blogs.sas.com/content/iml/2021/05/26/running-median-smoother.html
Extract the symmetric sequence about x[i], which is x[i-k-1:i+k+1] if
1 <= i-k-1 and i+k+1 <= nrow(x).
Otherwise, pad the sequence. You can pad with a user-defined value
(such as missing or zero) or you can repeat the first or last value.
This function loops over the number of columns and extracts each window.
INPUT: y is the time series
k is the radius of the window. The window length is 2*k+1 centered at y[i]
padVal can be missing or a character value. If character,
use y[1] to pad on the left and y[n] to pad on the right.
*/
start extractWindows(y, k, padVal="REPEAT");
n = nrow(y);
M = j(2*k+1, n, .);
/* create new series, z, by padding k values to the ends of y */
if type(padVal)='N' then
z = repeat(padVal, k) // y // repeat(padVal, k); /* pad with user value */
else
z = repeat(y[1], k) // y // repeat(y[n], k); /* Tukey's padding */
do i = 1 to n;
range = i : i+2*k; /* centered at k+i */
M[,i] = z[range];
end;
return M;
finish;
/* Hampel identifier: Use moving median and MAD to identify outliers
https://blogs.sas.com/content/iml/2021/06/01/hampel-filter-robust-outliers.html */
start HampelID(y, k, multiplier=3);
M = extractWindows(y, k, "REPEAT");
med = T( median(M) );
Nmad = T( mad(M, "NMAD") );
idx = loc( abs(y-med) > multiplier*Nmad );
return idx;
finish;
/* read in the series for each BY-group combination of cod_reg_clim and zona_clim */
use DIFF_Q_NUM_SORT;
read all var {diff ID};
close;
winLen = 6; /* set the window length */
Outlier = j(nrow(diff), 1, 0); /* most obs are not outliers */
firstObs = 1;
do i = 1 to max(ID);
groupIdx = loc(ID=i);
y = diff[groupIdx];
outlierIdx = HampelID(y, winLen);
if ncol(OutlierIdx) > 0 then /* there are outliers for this group */
Outlier[firstObs - 1 + outlierIdx] = 1; /* mark the outliers */
firstObs = firstObs + ncol(groupIdx);
end;
create HampelOutliers var "Outlier";
append;
close HampelOutliers;
QUIT;
/* merge the Hampel outlier variable */
data DIFF_Q_HAMPEL;
merge DIFF_Q_NUM_SORT HampelOutliers;
run;
/* visualize the time series and outliers */
proc sgpanel data=DIFF_Q_HAMPEL;
panelby zona_clim cod_reg_clim / layout=lattice onepanel;
scatter x=date_gas y=diff / group=Outlier;
run;
... View more