Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Home
- /
- Analytics
- /
- Stat Procs
- /
- Clustering different Time series

Options

- RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

☑ This topic is **solved**.
Need further help from the community? Please
sign in and ask a **new** question.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Posted 06-19-2024 09:24 AM
(488 views)

Hello commuity,

I would like to apply clustering to my 20 time series with a three years range time.

For instance comparing the distributions not just graphically.

With this view maybe I can put Distribution 1, 3 and 4 in Cluster A and Distribution 2 in Cluster B.

Is there a statistical method more accurate then this graphical example?

1 ACCEPTED SOLUTION

Accepted Solutions

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.

Since corr**2=RSquare, that would be a lot of easy for coding.

```
data have;
set sashelp.stocks;
keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;
```

Still Intel and Microsoft have the most similiar.

About how to choose the number of clusters, it is hard to deal with.

You need set a cutoff value to cluster.

E.X.

here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.

If you have more stocks ,that would be hard to code to get CLUSTER.

E.X.

a b 0.98 b c 0.89 d b 0.2 e d 0.3 e f 0.86 /*<--Changed*/

If you set cutoff=0.8 ,then

a b 1 b c 1 d b 0 e d 0 e f 1

So scan them by eyeball :

a,b,c is one cluster

e, f is another cluster <---Changed

d is another cluster

You could code to make it automatically , but that is another story (Searching a tree problem).

6 REPLIES 6

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

Very interesting question.

You could use R Square of OLS to check if the two time series is similiar.

And @Rick_SAS could have a better idea.

```
data have;
set sashelp.stocks;
keep stock close date;
run;
/*First check the time series by stocks*/
proc sgpanel data=have;
panelby stock/onepanel columns=1;
series x=date y=close;
run;
/*Calculate the RSquare of OLS for checking the difference between two series*/
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc reg data=have2 noprint rsquare outest=outest;
IBM_Intel: model IBM=Intel;
IBM_Microsoft: model IBM=Microsoft;
Intel_Microsoft: model Intel=Microsoft;
quit;
proc print data=outest noobs;run;
```

You could see Intel and Microsoft have the max RSquare means they are most similiar series.

You also to set a CUTOFF value to cluster these time series.

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.

Since corr**2=RSquare, that would be a lot of easy for coding.

```
data have;
set sashelp.stocks;
keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;
```

Still Intel and Microsoft have the most similiar.

About how to choose the number of clusters, it is hard to deal with.

You need set a cutoff value to cluster.

E.X.

here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.

If you have more stocks ,that would be hard to code to get CLUSTER.

E.X.

a b 0.98 b c 0.89 d b 0.2 e d 0.3 e f 0.86 /*<--Changed*/

If you set cutoff=0.8 ,then

a b 1 b c 1 d b 0 e d 0 e f 1

So scan them by eyeball :

a,b,c is one cluster

e, f is another cluster <---Changed

d is another cluster

You could code to make it automatically , but that is another story (Searching a tree problem).

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

I would like to know if there was a different method because for Rsquared around 0.7 I have 8 clusters, this is maybe because there is no possibility to separate the series with 3 clusters?

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

The method I demonstrated is different with PROC CLUSTER or PROC FASTCLUS.

Once the cutoff value is settled up , the number of cluster is fixed.

E.X.

```
a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86
```

If you set cutoff=0.9 then

```
a b 1
b c 0
d b 0
e d 0
e f 0
```

a,b is one cluster

c is one cluster

d is one cluster

e is one cluster

f is one cluster

it is five cluster unlike three cluster I showed above.

If you really want to decide the number of cluster you could try Primary Component Analysis:

Rick_SAS 's blog here:

But you also need to decide it by yourself.

Deciding the number of cluste is a world/**unsolved** statistical question.

https://blogs.sas.com/content/iml/2014/11/07/distribution-of-blood-types.html

P.S. If you want to use SAS/ETS to solve this problem ,suggest you to post your question at Forecasting forum:

https://communities.sas.com/t5/SAS-Forecasting-and-Econometrics/bd-p/forecasting_econometrics

experts about time seriese analysis would give you constructive advice .

- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content

BTW, if you want "separate the series with 3 clusters" ,you could try K-Means Cluster by PROC FASTCLUS + maxclusters=

another way is using KNN method by proc modeclus + r= .

another way is using KNN method by proc modeclus + r= .

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. **Registration is now open through August 30th**. Visit the SAS Hackathon homepage.

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.