Hi all,
I need to cluster many time series.
In the past (SAS 9.4), I've used proc timeseries, which was constrained to a single process, and this way, the calculation took a long time to measure the distance between any time series provided.
Non I'm on viya 3.5, and I've tried both the TSMODEL approach with the TSD Package and rewritten the code using timeData.runTimeCode action set.
But still, I'm in the same situation, even if I'm running the code on a Viya 3.5 machine with 80 physical CPUs.
I've got over 8k series, with daily data spanning from Jan2008 up to Apr2021.
Here is the code with TSMODEL:
proc tsmodel data=casuser.gnc_vol_t_tra_impute_all outlog=casuser.outlog
outobj=(of=casuser.outtsddist(replace=YES) );
var _TR1_SI_30:;
id data interval=day;
require tsd;
submit;
declare object f(DTW);
declare object of(OUTTSD);
rc=f.Initialize();
rc=f.SetTarget(&si_remi_30);
rc=f.SetOption("METRIC", "RSQRDEV", "NORMALIZE", "STD", "TRIM", "BOTH");
rc=f.Run();
if rc < 0 then
stop;
rc=of.Collect(f);
if rc < 0 then
stop;
endsubmit;
print outlog;
run;
And here is the code with PROC CAS:
%macro cmpcode();
declare object f(DTW);
declare object of(OUTTSD);
rc=f.Initialize();
rc=f.SetTarget(&batch1_p);
rc=f.SetOption('METRIC', 'RSQRDEV', 'NORMALIZE', 'STD', 'TRIM', 'BOTH');
rc=f.Run();
if rc < 0 then
stop;
rc=of.Collect(f);
if rc < 0 then
stop;
%mend;
proc cas;
* like proc contents or SQL on Dictionary libref ;
table.columnInfo result=allvars / table={name="gnc_vol_t_tra_impute_all"};
run;
saveresult allvars casout="myallvars";
* reading vars with a custom filter for interval vars;
table.fetch result=selectedVars / table={name='myallvars', where="
Column not like '_TR1_tot_%' and Column not in ('_TR1_period', '_TR1_residuo',
'_TR1_settimana', 'data', 'int_conf', '_NAME_', '_PGNC_')
"}, fetchvars={{name='Column'}} to=&limit maxrows=&limit;
run;
* array creation for runtimecode and DST object ;
varList=${};
oth_varlist=${};
do row over selectedVars.Fetch;
singleVar=compress(row.Column);
/* varList[row._Index_]= "{name="||quote(singleVar) || "}"; */
varList[row._Index_]= singleVar;
end;
print varList;
cmpcode="%cmpcode()";
timeData.runTimeCode result=run /
table={name="gnc_vol_t_tra_impute_all"}
logControl={{keep=TRUE, sev="ERROR"}}
require={{pkg="TSD"}}
series=varList
timeid="data"
interval="day"
objOut={
{objRef="of", table={name='outtsddist' replace=TRUE}}
}
logout ={name="TSMODEL_LOG" replace=True}
code=cmpcode;
run;
quit;
And here I'm reporting the time expended:
N# Vars | Secs |
10 | 3,60 |
20 | 14,40 |
40 | 57,60 |
80 | 230,40 |
160 | 921,60 |
320 | 3.686,40 |
If I project the time needed for the 8k time series, I will need over 40 days of calculation.
My question is: has SAS implemented some faster algorithms like MASS (https://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html)?
If not, any suggestion is welcome!
Are you trying to find the pairwise distance between these 8000 time series using DTW distance measure?
This gives the pairwise distance between your time series, given that you the column names are column1, column2, ...
cas mycas;
libname mycas cas;
data mycas.testdata;
input time column1 column2 column3;
datalines;
1 2 3 0
2 4 5 9
3 6 3 0
4 7 3 1
5 3 3 2
6 8 6 3
7 9 3 4
8 3 8 0
9 10 7 5
10 11 9 3
;
run;
%macro t();
proc tsmodel data=mycas.testdata
outscalar=mycas.outscalar;
id time interval=obs;
var _numeric_;
outscalars %do i=1 %to 3; %do j=&i.+1 %to 3; measure_&i._&j. %end;%end;;
require tsa;
submit;
declare object TSA(tsa);
%do i=1 %to 2;
%do j=&i+1 %to 3;
rc = TSA.SIMILARITY(column&i, column&j, 'SQRDEV', 'NONE' , , , , , measure_&i._&j.);
%end;
%end;
endsubmit;
run;
%mend;
%t;
Hi Andrea
N#vars | Secs |
10 | (requested DTW distances for 45 pairs) 3,60 |
20 | (190 pairs) 14,40 |
40 | (760 pairs) 57,60 |
80 | (3,160 pairs) 230,40 |
160 | (12,720 pairs) 921,60 |
320 | ( 51,040 pairs) 3.686,40 |
Let’s look at the timing per pair,
3.60 sec/45 = 0.068 sec. per pair
14.40sec/190 = 0.076 sec
57.60 sec/760 = 0.75 sec
…
3686.40 sec/ 51040 = 0.072 sec
So I can say the DTW distance calculation in proc tsmodel is scalable.
Thank Taiyenong,
I know that TSMODEL goes on a single machine, but monitoring the machine, I saw that the machine was using 2 CPUs at 100% while the remaining was idle.
Whit your suggestion about YWINSISE, I almost cut the time needed by 50%.
But still, I can't understand why the DTW algorithm is not fully parallelled since it just calculated all the combinations (pairwise) of the provide time series.
Thanks again!
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.
Find more tutorials on the SAS Users YouTube channel.