I'm attaching a document where a new promising strategy for finding the number of clusters is explained along with a sample commented SAS code.
Grateful for any comments
Ulderico Santarelli
Many of us will not (or cannot) download Microsoft Office documents because they are security threats. Can you provide a link to a web page that has this information?
I uploaded both documents into Google Drive. Here they are
https://drive.google.com/file/d/19maSnXSdXtql61tshGGLnvwgvF6W8c04/view?usp=sharing
https://drive.google.com/file/d/1lJd9f96w_J41BmvW8CQWNTZ5gPph4Q9P/view?usp=sharing
thank you for your interest.
Ulderico.
Access denied.
And if they are Word (or other Microsoft Office) documents on Google Drive, I still won't download them.
here are the links for pdf documents.
https://drive.google.com/file/d/14gt5AjNdmyAwKz5Tul00RIrZy0O3n1-S/view?usp=sharing
https://drive.google.com/file/d/1ReFpJUGSAzz2xPVxbRjtYa-HxkKAPupz/view?usp=sharing
they should be virus free. I'm using Malware software that seems very powerful.
Ulderico.
@Bootsy1 wrote:
here are the links for pdf documents.
https://drive.google.com/file/d/14gt5AjNdmyAwKz5Tul00RIrZy0O3n1-S/view?usp=sharing
https://drive.google.com/file/d/1ReFpJUGSAzz2xPVxbRjtYa-HxkKAPupz/view?usp=sharing
Access denied
they should be virus free. I'm using Malware software that seems very powerful.
With regards to computer security (of my computer), why should I believe you? However, PDF is an acceptable form of document, but I still can't access it.
I'm going to upload pdf docs in the Community's workspace
You sould be able to get them right away
There is not an right answer in the world for this question.
But you could check CCC option of PROC CLUSTER
or use Principle Component Analysis to check it by plot the first two principle component.
@Rick_SAS wrote a blog about it for race and blood relationship .
I find that the main challenges of Clustering are two:
1. one acts on a sample. This entails monumental consequences. Different samples share no points with probability almost 1. So that you can never claim replicability in clustering if you follow any of the many extant algorithms that go on sequentially. Only if you act on "central points", actually local means, you can claim replicability.
2. sequential methods reach a solution, of course. However, you never know how much the solution is far form the optimal one.
Going parallel has two advantages:
1. you find "central points", that is points that have many surrounding ones so that they don't move during iterations. Central points are local means that have a surrounding subsample, aka cluster. This makes their standard error to be much less than the standard deviation that measures the variability of single points. So that, if you follow the "any point is good" approach, where all points are equivalent, you are exposed to the variability of sigma, while if you act on "central points", actually local means, you face a much smaller variability, actually a fraction of the sample standard deviation. That's why central points remain stable during iterations
2. you avoid the worst of sequential problems, where the solution varies with the point you start from. Because you act on points that have a high variability, the first point decides which one the solution will be.
In my opinion, the method should be parallel.
continuing my research about the the number of clusters, I found that the gravitational approach can be better described as in the attachment. The Gravitational Force Field accurately describes where the mass is laud in the space so that it is possible to view the gravitational force as an indicator of the local distribution density.
Comments are very welcome.
Ulderico.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.