BookmarkSubscribeRSS Feed
harmonic
Obsidian | Level 7

Hello commuity,

 

I would like to apply clustering to my 20 time series with a three years range time. 

 

For instance comparing the distributions not just graphically.

harmonic_0-1718803363834.png

 

With this view maybe I can put Distribution 1, 3 and 4 in Cluster A and Distribution 2 in Cluster B.

 

Is there a statistical method more accurate then this graphical example?


6 REPLIES 6
Ksharp
Super User

Very interesting question.

You could use R Square of OLS  to check if the two time series is similiar.

And @Rick_SAS could have a better idea.

data have;
 set sashelp.stocks;
 keep stock close date;
run;

/*First check the time series by stocks*/
proc sgpanel data=have;
panelby stock/onepanel columns=1;
series x=date y=close;
run;

/*Calculate the RSquare of OLS for checking the difference between two series*/
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;

proc reg data=have2 noprint rsquare outest=outest;
IBM_Intel:       model IBM=Intel;
IBM_Microsoft:   model IBM=Microsoft;
Intel_Microsoft: model Intel=Microsoft;
quit;
proc print data=outest noobs;run;

Ksharp_0-1718851109206.png

 

You could see Intel and Microsoft have the max RSquare means they are most similiar series.

You also to set a CUTOFF value to cluster these time series.

 

harmonic
Obsidian | Level 7

To cluster 20 Time Series should I compare all the series and then how could I choose the number of clusters with this method. Because this is used to calculate the similarity between two series.

Ksharp
Super User

If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.

Since corr**2=RSquare, that would be a lot of easy for coding.

data have;
 set sashelp.stocks;
 keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;

proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;

Ksharp_0-1718873712535.png

Still Intel and Microsoft have the most similiar.

 

About how to choose the number of clusters, it is hard to deal with.

You need set a cutoff value to cluster.

E.X.

here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.

 

If you have more stocks ,that would be hard to code to get CLUSTER.

E.X.

a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86  /*<--Changed*/

If you set cutoff=0.8 ,then 

a b  1
b c  1
d b 0
e d 0
e f 1

So scan them by eyeball :

a,b,c is one cluster

e, f   is another cluster   <---Changed

d       is another cluster

You could code to make it automatically , but that is another story (Searching a tree problem).

 

harmonic
Obsidian | Level 7

I already used proc tsmodel to calculate the ward distance and triangular matric to pass it to the proc cluster and tree, this is the result.
I would like to know if there was a different method because for Rsquared around 0.7 I have 8 clusters, this is maybe because there is no possibility to separate the series with 3 clusters?

harmonic_0-1718874949047.png

 

Ksharp
Super User

The method I demonstrated is different with PROC CLUSTER or PROC FASTCLUS.

Once the cutoff value is settled up , the number of cluster is fixed.

E.X.

a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86  

If you set cutoff=0.9 then 

a b 1
b c 0
d b 0
e d 0
e f 0

a,b is one cluster

c is one cluster

d is one cluster

e is one cluster

f is one cluster

it is five cluster unlike three cluster I showed above.

 

If you really want to decide the number of cluster you could try Primary Component Analysis:

Rick_SAS 's blog here:

 

But you also need to decide it by yourself.

Deciding the number of cluste is a world/unsolved statistical question.

https://blogs.sas.com/content/iml/2014/11/07/distribution-of-blood-types.html

 

 

P.S. If you want to use SAS/ETS to solve this problem ,suggest you to post your question at Forecasting forum:

https://communities.sas.com/t5/SAS-Forecasting-and-Econometrics/bd-p/forecasting_econometrics

experts about time seriese analysis would give you constructive advice .

 

Ksharp
Super User
BTW, if you want "separate the series with 3 clusters" ,you could try K-Means Cluster by PROC FASTCLUS + maxclusters=

another way is using KNN method by proc modeclus + r= .

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 6 replies
  • 187 views
  • 0 likes
  • 2 in conversation