## What is the sas code for testing significance in the kth nearest neighbor clustering method?

Solved
Occasional Contributor
Posts: 8

# What is the sas code for testing significance in the kth nearest neighbor clustering method?

I would like to know the sas code for testing significance in kth nearest neighbor clustering?

Accepted Solutions
Solution
‎03-26-2016 08:28 AM
Super User
Posts: 10,778

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

Some additional information taken from SAS Stat Documentation.

```One useful descriptive approach to the number-of-clusters problem is provided by Wong and Schaack (1982)
based on a kth-nearest-neighbor density estimate. The kth-nearest-neighbor clustering method developed by
Wong and Lane (1983) is applied with varying values of k. Each value of k yields an estimate of the number
of modal clusters. If the estimated number of modal clusters is constant for a wide range of k values, there
is strong evidence of at least that many modes in the population. A plot of the estimated number of modes
against k can be highly informative. Attempts to derive a formal hypothesis test from this diagnostic plot
have met with difficulties, but a simulation approach similar to Silverman (1986) does seem to work Girman
(1994). The simulation, of course, requires considerable computer time.

```

And you could also check CCC statistic and its disadvantage.

```Sarle (1983) used extensive simulations to develop the cubic clustering criterion (CCC), which can be used
for crude hypothesis testing and estimating the number of population clusters. The CCC is based on the
assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like
hypercubes. In large samples that can be divided into the appropriate number of hypercubes, this assumption
gives very accurate results. In other cases the approximation is generally conservative. For details about the
interpretation of the CCC, consult Sarle (1983).

```

All Replies
Super User
Posts: 10,778

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

You can use ANOVA(proc glm) to check if these clusters are significent .

Frequent Contributor
Posts: 90

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

We need much more background.  What is the study design, model, and hypothesis?

-Kevin

Occasional Contributor
Posts: 8

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

Survey data is collected from respondents about their shopping orientation using likert scaled questions. Using k-means method I am getting 3 clusters of shoppers. However, I want to cross-check my results using kth nearest neighbor clustering method as it is an unbiased method. My interest is in the knn clustering method and how significance test is conducted in knn to determine the number of true clusters.

Super User
Posts: 10,778

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

That is very hard question. As far as I know there is not well-known or uniform criterion to know the number of cluster you should get.

But I think you can performs principal component analysis to roughly know how many cluster there will be .

Occasional Contributor
Posts: 8

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

I will try principal components analysis. I think plotting the objects along the first two canonical variates will give me some idea about the possible number of clusters, I am new to SAS. Please let me know the proper sas code for the said analysis. Thank you in advance.

Solution
‎03-26-2016 08:28 AM
Super User
Posts: 10,778

## Re: What is the sas code for testing significance in the kth nearest neighbor clustering method?

Some additional information taken from SAS Stat Documentation.

```One useful descriptive approach to the number-of-clusters problem is provided by Wong and Schaack (1982)
based on a kth-nearest-neighbor density estimate. The kth-nearest-neighbor clustering method developed by
Wong and Lane (1983) is applied with varying values of k. Each value of k yields an estimate of the number
of modal clusters. If the estimated number of modal clusters is constant for a wide range of k values, there
is strong evidence of at least that many modes in the population. A plot of the estimated number of modes
against k can be highly informative. Attempts to derive a formal hypothesis test from this diagnostic plot
have met with difficulties, but a simulation approach similar to Silverman (1986) does seem to work Girman
(1994). The simulation, of course, requires considerable computer time.

```

And you could also check CCC statistic and its disadvantage.

```Sarle (1983) used extensive simulations to develop the cubic clustering criterion (CCC), which can be used
for crude hypothesis testing and estimating the number of population clusters. The CCC is based on the
assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like
hypercubes. In large samples that can be divided into the appropriate number of hypercubes, this assumption
gives very accurate results. In other cases the approximation is generally conservative. For details about the
interpretation of the CCC, consult Sarle (1983).

```
🔒 This topic is solved and locked.