Solved: Re: Clustering different Time series

harmonic · Posted 06-19-2024 09:24 AM

Hello commuity,

I would like to apply clustering to my 20 time series with a three years range time.

For instance comparing the distributions not just graphically.

With this view maybe I can put Distribution 1, 3 and 4 in Cluster A and Distribution 2 in Cluster B.

Is there a statistical method more accurate then this graphical example?

Ksharp · Posted 06-20-2024 05:03 AM

If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.

Since corr**2=RSquare, that would be a lot of easy for coding.

data have;
 set sashelp.stocks;
 keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;

proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;

Still Intel and Microsoft have the most similiar.

About how to choose the number of clusters, it is hard to deal with.

You need set a cutoff value to cluster.

E.X.

here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.

If you have more stocks ,that would be hard to code to get CLUSTER.

E.X.

a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86  /*<--Changed*/

If you set cutoff=0.8 ,then

a b  1
b c  1
d b 0
e d 0
e f 1

So scan them by eyeball :

a,b,c is one cluster

e, f is another cluster <---Changed

d is another cluster

You could code to make it automatically , but that is another story (Searching a tree problem).

View solution in original post

Ksharp · Posted 06-19-2024 10:40 PM

Very interesting question.

You could use R Square of OLS to check if the two time series is similiar.

And @Rick_SAS could have a better idea.

data have;
 set sashelp.stocks;
 keep stock close date;
run;

/*First check the time series by stocks*/
proc sgpanel data=have;
panelby stock/onepanel columns=1;
series x=date y=close;
run;

/*Calculate the RSquare of OLS for checking the difference between two series*/
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;

proc reg data=have2 noprint rsquare outest=outest;
IBM_Intel:       model IBM=Intel;
IBM_Microsoft:   model IBM=Microsoft;
Intel_Microsoft: model Intel=Microsoft;
quit;
proc print data=outest noobs;run;

You could see Intel and Microsoft have the max RSquare means they are most similiar series.

You also to set a CUTOFF value to cluster these time series.

harmonic · Posted 06-20-2024 04:29 AM

To cluster 20 Time Series should I compare all the series and then how could I choose the number of clusters with this method. Because this is used to calculate the similarity between two series.

Ksharp · Posted 06-20-2024 05:03 AM

If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.

Since corr**2=RSquare, that would be a lot of easy for coding.

data have;
 set sashelp.stocks;
 keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;

proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;

Still Intel and Microsoft have the most similiar.

About how to choose the number of clusters, it is hard to deal with.

You need set a cutoff value to cluster.

E.X.

here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.

If you have more stocks ,that would be hard to code to get CLUSTER.

E.X.

a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86  /*<--Changed*/

If you set cutoff=0.8 ,then

a b  1
b c  1
d b 0
e d 0
e f 1

So scan them by eyeball :

a,b,c is one cluster

e, f is another cluster <---Changed

d is another cluster

You could code to make it automatically , but that is another story (Searching a tree problem).

harmonic · Posted 06-20-2024 05:17 AM

I already used proc tsmodel to calculate the ward distance and triangular matric to pass it to the proc cluster and tree, this is the result.
I would like to know if there was a different method because for Rsquared around 0.7 I have 8 clusters, this is maybe because there is no possibility to separate the series with 3 clusters?

Ksharp · Posted 06-20-2024 05:41 AM

The method I demonstrated is different with PROC CLUSTER or PROC FASTCLUS.

Once the cutoff value is settled up , the number of cluster is fixed.

E.X.

a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86

If you set cutoff=0.9 then

a b 1
b c 0
d b 0
e d 0
e f 0

a,b is one cluster

c is one cluster

d is one cluster

e is one cluster

f is one cluster

it is five cluster unlike three cluster I showed above.

If you really want to decide the number of cluster you could try Primary Component Analysis:

Rick_SAS 's blog here:

But you also need to decide it by yourself.

Deciding the number of cluste is a world/unsolved statistical question.

https://blogs.sas.com/content/iml/2014/11/07/distribution-of-blood-types.html

P.S. If you want to use SAS/ETS to solve this problem ,suggest you to post your question at Forecasting forum:

https://communities.sas.com/t5/SAS-Forecasting-and-Econometrics/bd-p/forecasting_econometrics

experts about time seriese analysis would give you constructive advice .

Ksharp · Posted 06-20-2024 05:47 AM

BTW, if you want "separate the series with 3 clusters" ,you could try K-Means Cluster by PROC FASTCLUS + maxclusters=

another way is using KNN method by proc modeclus + r= .

Registration is open