BookmarkSubscribeRSS Feed
Mathis1
Quartz | Level 8

Hi,

I would like to perform a KNN procedure and being able to display the cluster on a 2-dim plot.

I 'm used to perform the k-means alg with : 

 

Proc fastclus DATA=CORR_ACC maxclusters=8 maxiter=100 outseed=Mathis out=resultats;
VAR dim: ;
ID id ;
RUN ;

 

 

proc sgplot data=resultats noautolegend ;
scatter x=dim1 y=dim2 / Group= CLUSTER /*datalabel=CLUSTER*/ name="ACM"
legendlabel="ACM";
keylegend "ACM";
run;

 

Where CORR_ACC nis the output of a proc Corresp. But i'm really struggling with the KNN. 

 

Any help for a simple way to do this ?

12 REPLIES 12
PaigeMiller
Diamond | Level 26

I would like to perform a KNN procedure and being able to display the cluster on a 2-dim plot.


Which output(s) of the KNN do you want to plot in two dimensions?

--
Paige Miller
PaigeMiller
Diamond | Level 26

@Mathis1 wrote:
This kind : https://www.mathworks.com/help/examples/stats/win64/ClassifyingQueryDataUsingKnnsearchExample_01.png

I'm afraid you haven't answered the question. I asked "which outputs" and you showed me "what the plot should look like". So ... which outputs from the KNN do you want to cluster? Any outputs from a KNN will be stored in a SAS data set, or printed to the output. Please be specific. Show me.

--
Paige Miller
PGStats
Opal | Level 21

Are you referring to the kth-Nearest-Neighbor method of proc cluster?

PG
Mathis1
Quartz | Level 8
Yes i m referring to the kth nearest neighbors sorry.
Ksharp
Super User
PG ,
I think KNN is under PROC MODECLUS
PGStats
Opal | Level 21

You must combine proc cluster and proc tree, like this (simplified example data from proc corresp documentation) :

 

 

title 'United States Population, 1920-1970';

data USPop;

input Region $14. y1920 y1930 y1940 y1950 y1960 y1970;

label y1920 = '1920'    y1930 = '1930'    y1940 = '1940'
      y1950 = '1950'    y1960 = '1960'    y1970 = '1970';

datalines;
New England        7401  8166  8437  9314 10509 11842
NY, NJ, PA        22261 26261 27539 30146 34168 37199
Great Lakes       21476 25297 26626 30399 36225 40252
Midwest           12544 13297 13517 14061 15394 16319
South Atlantic    13990 15794 17823 21182 25972 30671
KY, TN, AL, MS     8893  9887 10778 11447 12050 12803
AR, LA, OK, TX    10242 12177 13065 14538 16951 19321
Mountain           3336  3702  4150  5075  6855  8282
Pacific            5567  8195  9733 14486 20339 25454
;

* Perform Simple Correspondence Analysis;
proc corresp data=uspop out=dims plots=none;
   var y1920 -- y1970;
   id Region;
run;

proc cluster data=dims method=density k=3 outtree=tree;
where _type_ = "OBS";
var dim1 dim2;
id region;
run;

proc tree data=tree nclusters=3 out=treegraph noprint; 
copy dim1 dim2;
run;

proc sgplot data=treegraph noautolegend ;
scatter x=dim1 y=dim2 / Group= CLUSTER datalabel=_name_
    name="ACM" legendlabel="ACM";
keylegend "ACM" / title="Cluster";
run;

CorrespClusterExample.png

 

PG
Mathis1
Quartz | Level 8

Hello PG and thank you very much for your reply 🙂
I had tried to do something like this, but the issue i have is that i get way too many clusters, even when specifying "nclusters=5" in the proc tree.

Please look at the table "TreeGraph", you'll se there are at least 70 Clusters :TreeGraph.PNG

 

 

Do you have any idea to remedy this ?

PGStats
Opal | Level 21

There are exactly 5 clusters defined in that dataset, they are identified as : 70, 3, 71, 59 and 22.

PG
Mathis1
Quartz | Level 8
No, this is only the end of the dataset, there are way more
Mathis1
Quartz | Level 8

Look, this is the table

PGStats
Opal | Level 21

I am not sure that meaningful clusters can be defined on these coordinates. I suspect that the problem you encountered with kth-Nearest-Neighbor method is due to ties, a problem which is discussed in the documentation but that I don't fully understand.

 

Anyway, increasing the number of neighbors (k) can bring the number of indentifyable clusters down, but I doubt this is very useful. For reference, starting from your dim1 and dim2 values and k=36 :

 

proc cluster data=sasforum.knn method=density k=36 outtree=tree plots=none;
var dim1 dim2;
run;

proc tree data=tree nclusters=5 out=treegraph noprint; 
copy dim1 dim2;
run;

proc sql;
select cluster, count(*) as n
from treegraph
group by cluster;
quit;

proc sgplot data=treegraph;
scatter x=dim1 y=dim2 / Group=CLUSTER;
run;
CLUSTER n
1 	1080
2 	124
3 	1
4 	1
5 	1

CorrespClusterExample23.png

It is hard to identify visually how meaningful clusters would look like for these data.

PG

SAS Innovate 2025: Register Now

Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 12 replies
  • 1828 views
  • 1 like
  • 4 in conversation