BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
partha1
Fluorite | Level 6

I would like to know the sas code for testing significance in kth nearest neighbor clustering?

1 ACCEPTED SOLUTION

Accepted Solutions
Ksharp
Super User

Some additional information taken from SAS Stat Documentation.

 

 

One useful descriptive approach to the number-of-clusters problem is provided by Wong and Schaack (1982)
based on a kth-nearest-neighbor density estimate. The kth-nearest-neighbor clustering method developed by
Wong and Lane (1983) is applied with varying values of k. Each value of k yields an estimate of the number
of modal clusters. If the estimated number of modal clusters is constant for a wide range of k values, there
is strong evidence of at least that many modes in the population. A plot of the estimated number of modes
against k can be highly informative. Attempts to derive a formal hypothesis test from this diagnostic plot
have met with difficulties, but a simulation approach similar to Silverman (1986) does seem to work Girman
(1994). The simulation, of course, requires considerable computer time.


 

 

And you could also check CCC statistic and its disadvantage.

 

 

Sarle (1983) used extensive simulations to develop the cubic clustering criterion (CCC), which can be used
for crude hypothesis testing and estimating the number of population clusters. The CCC is based on the
assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like
hypercubes. In large samples that can be divided into the appropriate number of hypercubes, this assumption
gives very accurate results. In other cases the approximation is generally conservative. For details about the
interpretation of the CCC, consult Sarle (1983).


View solution in original post

6 REPLIES 6
Ksharp
Super User

You can use ANOVA(proc glm) to check if these clusters are significent .

KevinViel
Pyrite | Level 9

We need much more background.  What is the study design, model, and hypothesis?

 

-Kevin

partha1
Fluorite | Level 6

 Survey data is collected from respondents about their shopping orientation using likert scaled questions. Using k-means method I am getting 3 clusters of shoppers. However, I want to cross-check my results using kth nearest neighbor clustering method as it is an unbiased method. My interest is in the knn clustering method and how significance test is conducted in knn to determine the number of true clusters.

Ksharp
Super User

That is very hard question. As far as I know there is not well-known or uniform criterion to know the number of cluster you should get.

But I think you can performs principal component analysis to roughly know how many cluster there will be .

partha1
Fluorite | Level 6

I will try principal components analysis. I think plotting the objects along the first two canonical variates will give me some idea about the possible number of clusters, I am new to SAS. Please let me know the proper sas code for the said analysis. Thank you in advance.

Ksharp
Super User

Some additional information taken from SAS Stat Documentation.

 

 

One useful descriptive approach to the number-of-clusters problem is provided by Wong and Schaack (1982)
based on a kth-nearest-neighbor density estimate. The kth-nearest-neighbor clustering method developed by
Wong and Lane (1983) is applied with varying values of k. Each value of k yields an estimate of the number
of modal clusters. If the estimated number of modal clusters is constant for a wide range of k values, there
is strong evidence of at least that many modes in the population. A plot of the estimated number of modes
against k can be highly informative. Attempts to derive a formal hypothesis test from this diagnostic plot
have met with difficulties, but a simulation approach similar to Silverman (1986) does seem to work Girman
(1994). The simulation, of course, requires considerable computer time.


 

 

And you could also check CCC statistic and its disadvantage.

 

 

Sarle (1983) used extensive simulations to develop the cubic clustering criterion (CCC), which can be used
for crude hypothesis testing and estimating the number of population clusters. The CCC is based on the
assumption that a uniform distribution on a hyperrectangle will be divided into clusters shaped roughly like
hypercubes. In large samples that can be divided into the appropriate number of hypercubes, this assumption
gives very accurate results. In other cases the approximation is generally conservative. For details about the
interpretation of the CCC, consult Sarle (1983).


sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 6 replies
  • 1124 views
  • 2 likes
  • 3 in conversation