Re: Time series distances issue with time needed for over 8k series

andrea_magatti · Posted 06-09-2021 04:27 AM

Hi all,

I need to cluster many time series.

In the past (SAS 9.4), I've used proc timeseries, which was constrained to a single process, and this way, the calculation took a long time to measure the distance between any time series provided.

Non I'm on viya 3.5, and I've tried both the TSMODEL approach with the TSD Package and rewritten the code using timeData.runTimeCode action set.

But still, I'm in the same situation, even if I'm running the code on a Viya 3.5 machine with 80 physical CPUs.

I've got over 8k series, with daily data spanning from Jan2008 up to Apr2021.

Here is the code with TSMODEL:

proc tsmodel data=casuser.gnc_vol_t_tra_impute_all outlog=casuser.outlog 
		outobj=(of=casuser.outtsddist(replace=YES) );
	var _TR1_SI_30:;
	id data interval=day;
	require tsd;
	submit;
	declare object f(DTW);
	declare object of(OUTTSD);
	rc=f.Initialize();
	rc=f.SetTarget(&si_remi_30);
	rc=f.SetOption("METRIC", "RSQRDEV", "NORMALIZE", "STD", "TRIM", "BOTH");
	rc=f.Run();

	if rc < 0 then
		stop;
	rc=of.Collect(f);

	if rc < 0 then
		stop;
	endsubmit;
	print outlog;
run;

And here is the code with PROC CAS:

%macro cmpcode();
	declare object f(DTW);
	declare object of(OUTTSD);
	rc=f.Initialize();
	rc=f.SetTarget(&batch1_p);
	rc=f.SetOption('METRIC', 'RSQRDEV', 'NORMALIZE', 'STD', 'TRIM', 'BOTH');
	rc=f.Run();
	if rc < 0 then
		stop;
	rc=of.Collect(f);
	if rc < 0 then
		stop;
%mend;


proc cas;

	* like proc contents or SQL on Dictionary libref ;
	table.columnInfo result=allvars / table={name="gnc_vol_t_tra_impute_all"};
	run;
	saveresult allvars casout="myallvars";
	* reading vars with a custom filter for interval vars;
	table.fetch result=selectedVars / table={name='myallvars', where="
					Column not like '_TR1_tot_%' and Column not in ('_TR1_period', '_TR1_residuo', 
					'_TR1_settimana', 'data', 'int_conf', '_NAME_', '_PGNC_')
					"}, fetchvars={{name='Column'}} to=&limit maxrows=&limit;
	run;
	
	* array creation for runtimecode and DST object ;
	varList=${};
	oth_varlist=${};
	do row over selectedVars.Fetch;
		singleVar=compress(row.Column);
/* 		varList[row._Index_]= "{name="||quote(singleVar) || "}"; */
		varList[row._Index_]= singleVar;
	end;
	print varList;

	cmpcode="%cmpcode()";
	timeData.runTimeCode result=run /
		table={name="gnc_vol_t_tra_impute_all"}		
		logControl={{keep=TRUE, sev="ERROR"}}
		require={{pkg="TSD"}}
		series=varList
		timeid="data"
		interval="day"
		objOut={
			{objRef="of", table={name='outtsddist' replace=TRUE}}
			}
		logout ={name="TSMODEL_LOG" replace=True}
		code=cmpcode;
	run;
quit;

And here I'm reporting the time expended:

N# Vars	Secs
10	3,60
20	14,40
40	57,60
80	230,40
160	921,60
320	3.686,40

If I project the time needed for the 8k time series, I will need over 40 days of calculation.

My question is: has SAS implemented some faster algorithms like MASS (https://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html)?

If not, any suggestion is welcome!

imvash · Posted 06-09-2021 03:10 PM

Are you trying to find the pairwise distance between these 8000 time series using DTW distance measure?

imvash · Posted 06-09-2021 03:28 PM

This gives the pairwise distance between your time series, given that you the column names are column1, column2, ...

cas mycas;
libname mycas cas;

data mycas.testdata;
	input time column1 column2 column3;
	datalines;
	1   2	3	0
	2   4	5	9
	3   6	3	0
	4   7	3	1
	5   3	3	2
	6   8	6	3
	7   9	3	4
	8   3	8	0
	9   10	7	5
	10  11	9	3
	;
run;

%macro t();
proc tsmodel data=mycas.testdata
	outscalar=mycas.outscalar;
	id time interval=obs;
	var _numeric_;
	outscalars %do i=1 %to 3; %do j=&i.+1 %to 3; measure_&i._&j. %end;%end;;
	require tsa;
	submit;
		declare object TSA(tsa);
		%do i=1 %to 2;
			%do j=&i+1 %to 3;
				rc = TSA.SIMILARITY(column&i, column&j, 'SQRDEV', 'NONE' , , , , , measure_&i._&j.);
			%end;
		%end;
	endsubmit;
run;
%mend;
%t;

taiyeong · Posted 06-09-2021 09:12 PM

Hi Andrea

First, Proc TSMODEL uses a single machine unless the by-group processing is used.
According to your program, you requested the following number of DTW distances

N#vars	Secs
10	(requested DTW distances for 45 pairs) 3,60
20	(190 pairs) 14,40
40	(760 pairs) 57,60
80	(3,160 pairs) 230,40
160	(12,720 pairs) 921,60
320	( 51,040 pairs) 3.686,40

Let’s look at the timing per pair,

3.60 sec/45 = 0.068 sec. per pair

14.40sec/190 = 0.076 sec

57.60 sec/760 = 0.75 sec

…

3686.40 sec/ 51040 = 0.072 sec

So I can say the DTW distance calculation in proc tsmodel is scalable.

You may use XWINSIZE and YWINSISE (the expansion and contraction limits) options in the TSD package to speed up the DTW distance calculation.

MASS is used to do a similarity search given a series or a subsequence. If you want to achieve a similar goal to the MASS example, you have to set a target series (a query series) and set all your 8k series to input in the TSD package. If so, based on your timing table, I guess the run time will be about 560 seconds (0.07 x 8k). And you may also look at the motif score (MTFSCORE) object in the MTF package. The MTFSCORE object executes the scoring action that, given a motif sequence, finds motif instances in new sequences.

andrea_magatti · Posted 06-14-2021 02:30 AM

Thank Taiyenong,

I know that TSMODEL goes on a single machine, but monitoring the machine, I saw that the machine was using 2 CPUs at 100% while the remaining was idle.

Whit your suggestion about YWINSISE, I almost cut the time needed by 50%.

But still, I can't understand why the DTW algorithm is not fully parallelled since it just calculated all the combinations (pairwise) of the provide time series.

Thanks again!