BookmarkSubscribeRSS Feed
andrea_magatti
Obsidian | Level 7

Hi all,

I need to cluster many time series.

In the past (SAS 9.4), I've used proc timeseries, which was constrained to a single process, and this way, the calculation took a long time to measure the distance between any time series provided.

Non I'm on viya 3.5, and I've tried both the TSMODEL approach with the TSD Package and rewritten the code using timeData.runTimeCode action set.

But still, I'm in the same situation, even if I'm running the code on a Viya 3.5 machine with 80 physical CPUs.

I've got over 8k series, with daily data spanning from Jan2008 up to Apr2021.

Here is the code with TSMODEL:

proc tsmodel data=casuser.gnc_vol_t_tra_impute_all outlog=casuser.outlog 
		outobj=(of=casuser.outtsddist(replace=YES) );
	var _TR1_SI_30:;
	id data interval=day;
	require tsd;
	submit;
	declare object f(DTW);
	declare object of(OUTTSD);
	rc=f.Initialize();
	rc=f.SetTarget(&si_remi_30);
	rc=f.SetOption("METRIC", "RSQRDEV", "NORMALIZE", "STD", "TRIM", "BOTH");
	rc=f.Run();

	if rc < 0 then
		stop;
	rc=of.Collect(f);

	if rc < 0 then
		stop;
	endsubmit;
	print outlog;
run;

And here is the code with PROC CAS:

%macro cmpcode();
	declare object f(DTW);
	declare object of(OUTTSD);
	rc=f.Initialize();
	rc=f.SetTarget(&batch1_p);
	rc=f.SetOption('METRIC', 'RSQRDEV', 'NORMALIZE', 'STD', 'TRIM', 'BOTH');
	rc=f.Run();
	if rc < 0 then
		stop;
	rc=of.Collect(f);
	if rc < 0 then
		stop;
%mend;


proc cas;

	* like proc contents or SQL on Dictionary libref ;
	table.columnInfo result=allvars / table={name="gnc_vol_t_tra_impute_all"};
	run;
	saveresult allvars casout="myallvars";
	* reading vars with a custom filter for interval vars;
	table.fetch result=selectedVars / table={name='myallvars', where="
					Column not like '_TR1_tot_%' and Column not in ('_TR1_period', '_TR1_residuo', 
					'_TR1_settimana', 'data', 'int_conf', '_NAME_', '_PGNC_')
					"}, fetchvars={{name='Column'}} to=&limit maxrows=&limit;
	run;
	
	* array creation for runtimecode and DST object ;
	varList=${};
	oth_varlist=${};
	do row over selectedVars.Fetch;
		singleVar=compress(row.Column);
/* 		varList[row._Index_]= "{name="||quote(singleVar) || "}"; */
		varList[row._Index_]= singleVar;
	end;
	print varList;

	cmpcode="%cmpcode()";
	timeData.runTimeCode result=run /
		table={name="gnc_vol_t_tra_impute_all"}		
		logControl={{keep=TRUE, sev="ERROR"}}
		require={{pkg="TSD"}}
		series=varList
		timeid="data"
		interval="day"
		objOut={
			{objRef="of", table={name='outtsddist' replace=TRUE}}
			}
		logout ={name="TSMODEL_LOG" replace=True}
		code=cmpcode;
	run;
quit;

And here I'm reporting the time expended:

N# VarsSecs
10                       3,60
20                     14,40
40                     57,60
80                   230,40
160                   921,60
320               3.686,40

If I project the time needed for the 8k time series, I will need over 40 days of calculation.

 

My question is: has SAS implemented some faster algorithms like  MASS (https://www.cs.unm.edu/~mueen/FastestSimilaritySearch.html)?

If not, any suggestion is welcome!

4 REPLIES 4
imvash
SAS Employee

Are you trying to find the pairwise distance between these 8000 time series using DTW distance measure?

imvash
SAS Employee

This gives the pairwise distance between your time series, given that you the column names are column1, column2, ...

cas mycas;
libname mycas cas;

data mycas.testdata;
	input time column1 column2 column3;
	datalines;
	1   2	3	0
	2   4	5	9
	3   6	3	0
	4   7	3	1
	5   3	3	2
	6   8	6	3
	7   9	3	4
	8   3	8	0
	9   10	7	5
	10  11	9	3
	;
run;

%macro t();
proc tsmodel data=mycas.testdata
	outscalar=mycas.outscalar;
	id time interval=obs;
	var _numeric_;
	outscalars %do i=1 %to 3; %do j=&i.+1 %to 3; measure_&i._&j. %end;%end;;
	require tsa;
	submit;
		declare object TSA(tsa);
		%do i=1 %to 2;
			%do j=&i+1 %to 3;
				rc = TSA.SIMILARITY(column&i, column&j, 'SQRDEV', 'NONE' , , , , , measure_&i._&j.);
			%end;
		%end;
	endsubmit;
run;
%mend;
%t;
taiyeong
SAS Employee

Hi Andrea

 

  1. First, Proc TSMODEL uses a single machine unless the by-group processing is used.
  2. According to your program, you requested the following number of DTW distances

                   N#vars

                                                                                 Secs

                  10

 (requested DTW distances for 45 pairs)                    3,60

                  20

 (190 pairs)                                                                14,40

                  40

 (760 pairs)                                                                 57,60

                  80

 (3,160 pairs)                                                            230,40

                 160

 (12,720 pairs)                                                          921,60

                 320

 ( 51,040 pairs)                                                         3.686,40

 

Let’s look at the timing per pair,

 

3.60 sec/45 = 0.068 sec. per pair

14.40sec/190 = 0.076 sec

57.60 sec/760 = 0.75 sec

3686.40 sec/ 51040 =  0.072 sec   

 

So I can say the DTW distance calculation in proc tsmodel is scalable.

 

  1. You may use XWINSIZE and YWINSISE (the expansion and contraction limits) options in the TSD package to speed up the DTW distance calculation.

 

  1. MASS is used to do a similarity search given a series or a subsequence. If you want to achieve a similar goal to the MASS example, you have to set a target series (a query series) and set all your 8k series to input in the TSD package. If so, based on your timing table, I guess the run time will be about 560 seconds (0.07 x 8k).  And you may also look at the motif score (MTFSCORE) object in the MTF package. The MTFSCORE object executes the scoring action that, given a motif sequence, finds motif instances in new sequences. 
andrea_magatti
Obsidian | Level 7

Thank Taiyenong,

I know that TSMODEL goes on a single machine, but monitoring the machine, I saw that the machine was using 2 CPUs at 100% while the remaining was idle.

Whit your suggestion about YWINSISE, I almost cut the time needed by 50%.

But still, I can't understand why the DTW algorithm is not fully parallelled since it just calculated all the combinations (pairwise) of the provide time series.

Thanks again!

 

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

Multiple Linear Regression in SAS

Learn how to run multiple linear regression models with and without interactions, presented by SAS user Alex Chaplin.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 4 replies
  • 554 views
  • 0 likes
  • 3 in conversation