BookmarkSubscribeRSS Feed
Bootsy1
Calcite | Level 5

I'm attaching a document where a new promising strategy for finding the number of clusters is explained along with a sample commented SAS code.

Grateful for any comments

Ulderico Santarelli

9 REPLIES 9
PaigeMiller
Diamond | Level 26

Many of us will not (or cannot) download Microsoft Office documents because they are security threats. Can you provide a link to a web page that has this information?

--
Paige Miller
PaigeMiller
Diamond | Level 26

Access denied.

 

And if they are Word (or other Microsoft Office) documents on Google Drive, I still won't download them.

--
Paige Miller
Bootsy1
Calcite | Level 5

here are the links for pdf documents.

 

https://drive.google.com/file/d/14gt5AjNdmyAwKz5Tul00RIrZy0O3n1-S/view?usp=sharing

https://drive.google.com/file/d/1ReFpJUGSAzz2xPVxbRjtYa-HxkKAPupz/view?usp=sharing

 

they should be virus free. I'm using Malware software that seems very powerful.

 

Ulderico.

PaigeMiller
Diamond | Level 26

@Bootsy1 wrote:

here are the links for pdf documents.

 

https://drive.google.com/file/d/14gt5AjNdmyAwKz5Tul00RIrZy0O3n1-S/view?usp=sharing

https://drive.google.com/file/d/1ReFpJUGSAzz2xPVxbRjtYa-HxkKAPupz/view?usp=sharing

 


Access denied

 

they should be virus free. I'm using Malware software that seems very powerful.

 

With regards to computer security (of my computer), why should I believe you? However, PDF is an acceptable form of document, but I still can't access it.

--
Paige Miller
Bootsy1
Calcite | Level 5

I'm going to upload pdf docs in the Community's workspace

You sould be able to get them right away

Ksharp
Super User

There is not an right answer in the world for this question.

But you could check CCC option of PROC CLUSTER 

or use Principle Component Analysis to check it by plot the first two principle component.

@Rick_SAS  wrote a blog about it for race and blood relationship .

Bootsy1
Calcite | Level 5

I find that the main challenges of Clustering are two:

1.        one acts on a sample. This entails monumental consequences. Different samples share no points with probability almost 1. So that you can never claim replicability in clustering if you follow any of the many extant algorithms that go on sequentially. Only if you act on "central points", actually local means, you can claim replicability.

2.        sequential methods reach a solution, of course. However, you never know how much the solution is far form the optimal one.

Going parallel has two advantages:

1.         you find "central points", that is points that have many surrounding ones so that they don't move during iterations. Central points are local means that have a surrounding subsample, aka cluster. This makes their standard error to be much less than the standard deviation that measures the variability of single points. So that, if you follow the "any point is good" approach, where all points are equivalent, you are exposed to the variability of sigma, while if you act on "central points", actually local means, you face a much smaller variability, actually a fraction of the sample standard deviation. That's why central points remain stable during iterations

2.          you avoid the worst of sequential problems, where the solution varies with the point you start from. Because you act on points that have a high variability, the first point decides which one the solution will be.

 

In my opinion, the method should be parallel.

            

Bootsy1
Calcite | Level 5

continuing my research about the the number of clusters, I found that the gravitational approach can be better described as in the attachment. The Gravitational Force Field accurately describes where the mass is laud in the space so that it is possible to view the gravitational force as an indicator of the local distribution density. 

Comments are very welcome.

Ulderico.   

SAS Innovate 2025: Save the Date

 SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!

Save the date!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 9 replies
  • 1357 views
  • 0 likes
  • 3 in conversation