Hello commuity,
I would like to apply clustering to my 20 time series with a three years range time.
For instance comparing the distributions not just graphically.
With this view maybe I can put Distribution 1, 3 and 4 in Cluster A and Distribution 2 in Cluster B.
Is there a statistical method more accurate then this graphical example?
If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.
Since corr**2=RSquare, that would be a lot of easy for coding.
data have;
set sashelp.stocks;
keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;
Still Intel and Microsoft have the most similiar.
About how to choose the number of clusters, it is hard to deal with.
You need set a cutoff value to cluster.
E.X.
here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.
If you have more stocks ,that would be hard to code to get CLUSTER.
E.X.
a b 0.98 b c 0.89 d b 0.2 e d 0.3 e f 0.86 /*<--Changed*/
If you set cutoff=0.8 ,then
a b 1 b c 1 d b 0 e d 0 e f 1
So scan them by eyeball :
a,b,c is one cluster
e, f is another cluster <---Changed
d is another cluster
You could code to make it automatically , but that is another story (Searching a tree problem).
Very interesting question.
You could use R Square of OLS to check if the two time series is similiar.
And @Rick_SAS could have a better idea.
data have;
set sashelp.stocks;
keep stock close date;
run;
/*First check the time series by stocks*/
proc sgpanel data=have;
panelby stock/onepanel columns=1;
series x=date y=close;
run;
/*Calculate the RSquare of OLS for checking the difference between two series*/
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc reg data=have2 noprint rsquare outest=outest;
IBM_Intel: model IBM=Intel;
IBM_Microsoft: model IBM=Microsoft;
Intel_Microsoft: model Intel=Microsoft;
quit;
proc print data=outest noobs;run;
You could see Intel and Microsoft have the max RSquare means they are most similiar series.
You also to set a CUTOFF value to cluster these time series.
To cluster 20 Time Series should I compare all the series and then how could I choose the number of clusters with this method. Because this is used to calculate the similarity between two series.
If you have lots of variable to compare ,you also could try Pearson Correlation Coefficience by PROC CORR.
Since corr**2=RSquare, that would be a lot of easy for coding.
data have;
set sashelp.stocks;
keep stock close date;
run;
proc sort data=have;by date stock;run;
proc transpose data=have out=have2(drop=_NAME_);
by date;
var close;
id stock;
run;
proc corr data=have2 outp=outp noprint;
var IBM Intel Microsoft;
run;
Still Intel and Microsoft have the most similiar.
About how to choose the number of clusters, it is hard to deal with.
You need set a cutoff value to cluster.
E.X.
here if you set corr>0.6 means two series are identity. here Intel and Microsoft is one cluster , IBM is another cluster.
If you have more stocks ,that would be hard to code to get CLUSTER.
E.X.
a b 0.98 b c 0.89 d b 0.2 e d 0.3 e f 0.86 /*<--Changed*/
If you set cutoff=0.8 ,then
a b 1 b c 1 d b 0 e d 0 e f 1
So scan them by eyeball :
a,b,c is one cluster
e, f is another cluster <---Changed
d is another cluster
You could code to make it automatically , but that is another story (Searching a tree problem).
I already used proc tsmodel to calculate the ward distance and triangular matric to pass it to the proc cluster and tree, this is the result.
I would like to know if there was a different method because for Rsquared around 0.7 I have 8 clusters, this is maybe because there is no possibility to separate the series with 3 clusters?
The method I demonstrated is different with PROC CLUSTER or PROC FASTCLUS.
Once the cutoff value is settled up , the number of cluster is fixed.
E.X.
a b 0.98
b c 0.89
d b 0.2
e d 0.3
e f 0.86
If you set cutoff=0.9 then
a b 1
b c 0
d b 0
e d 0
e f 0
a,b is one cluster
c is one cluster
d is one cluster
e is one cluster
f is one cluster
it is five cluster unlike three cluster I showed above.
If you really want to decide the number of cluster you could try Primary Component Analysis:
Rick_SAS 's blog here:
But you also need to decide it by yourself.
Deciding the number of cluste is a world/unsolved statistical question.
https://blogs.sas.com/content/iml/2014/11/07/distribution-of-blood-types.html
P.S. If you want to use SAS/ETS to solve this problem ,suggest you to post your question at Forecasting forum:
https://communities.sas.com/t5/SAS-Forecasting-and-Econometrics/bd-p/forecasting_econometrics
experts about time seriese analysis would give you constructive advice .
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.