Statistical Procedures

Mathis1 · Posted 06-28-2020 11:15 AM

Hi,

I would like to perform a KNN procedure and being able to display the cluster on a 2-dim plot.

I 'm used to perform the k-means alg with :

Proc fastclus DATA=CORR_ACC maxclusters=8 maxiter=100 outseed=Mathis out=resultats;
VAR dim: ;
ID id ;
RUN ;

proc sgplot data=resultats noautolegend ;
scatter x=dim1 y=dim2 / Group= CLUSTER /*datalabel=CLUSTER*/ name="ACM"
legendlabel="ACM";
keylegend "ACM";
run;

Where CORR_ACC nis the output of a proc Corresp. But i'm really struggling with the KNN.

Any help for a simple way to do this ?

PaigeMiller · Posted 06-28-2020 11:56 AM

I would like to perform a KNN procedure and being able to display the cluster on a 2-dim plot.

Which output(s) of the KNN do you want to plot in two dimensions?

--
Paige Miller

Mathis1 · Posted 06-28-2020 04:21 PM

This kind : https://www.mathworks.com/help/examples/stats/win64/ClassifyingQueryDataUsingKnnsearchExample_01.png

PaigeMiller · Posted 06-28-2020 05:22 PM

@Mathis1 wrote:
This kind : https://www.mathworks.com/help/examples/stats/win64/ClassifyingQueryDataUsingKnnsearchExample_01.png

I'm afraid you haven't answered the question. I asked "which outputs" and you showed me "what the plot should look like". So ... which outputs from the KNN do you want to cluster? Any outputs from a KNN will be stored in a SAS data set, or printed to the output. Please be specific. Show me.

--
Paige Miller

PGStats · Posted 06-28-2020 03:57 PM

Are you referring to the kth-Nearest-Neighbor method of proc cluster?

PG

Mathis1 · Posted 06-28-2020 04:18 PM

Yes i m referring to the kth nearest neighbors sorry.

Ksharp · Posted 06-29-2020 07:39 AM

PG ,
I think KNN is under PROC MODECLUS

PGStats · Posted 06-28-2020 05:59 PM

You must combine proc cluster and proc tree, like this (simplified example data from proc corresp documentation) :

title 'United States Population, 1920-1970';

data USPop;

input Region $14. y1920 y1930 y1940 y1950 y1960 y1970;

label y1920 = '1920'    y1930 = '1930'    y1940 = '1940'
      y1950 = '1950'    y1960 = '1960'    y1970 = '1970';

datalines;
New England        7401  8166  8437  9314 10509 11842
NY, NJ, PA        22261 26261 27539 30146 34168 37199
Great Lakes       21476 25297 26626 30399 36225 40252
Midwest           12544 13297 13517 14061 15394 16319
South Atlantic    13990 15794 17823 21182 25972 30671
KY, TN, AL, MS     8893  9887 10778 11447 12050 12803
AR, LA, OK, TX    10242 12177 13065 14538 16951 19321
Mountain           3336  3702  4150  5075  6855  8282
Pacific            5567  8195  9733 14486 20339 25454
;

* Perform Simple Correspondence Analysis;
proc corresp data=uspop out=dims plots=none;
   var y1920 -- y1970;
   id Region;
run;

proc cluster data=dims method=density k=3 outtree=tree;
where _type_ = "OBS";
var dim1 dim2;
id region;
run;

proc tree data=tree nclusters=3 out=treegraph noprint; 
copy dim1 dim2;
run;

proc sgplot data=treegraph noautolegend ;
scatter x=dim1 y=dim2 / Group= CLUSTER datalabel=_name_
    name="ACM" legendlabel="ACM";
keylegend "ACM" / title="Cluster";
run;

PG

Mathis1 · Posted 06-29-2020 05:21 AM

Hello PG and thank you very much for your reply 🙂
I had tried to do something like this, but the issue i have is that i get way too many clusters, even when specifying "nclusters=5" in the proc tree.

Please look at the table "TreeGraph", you'll se there are at least 70 Clusters :

Do you have any idea to remedy this ?

PGStats · Posted 06-29-2020 01:39 PM

There are exactly 5 clusters defined in that dataset, they are identified as : 70, 3, 71, 59 and 22.

PG

Mathis1 · Posted 06-29-2020 01:54 PM

No, this is only the end of the dataset, there are way more

Mathis1 · Posted 06-29-2020 02:04 PM

Look, this is the table

PGStats · Posted 06-29-2020 05:22 PM

I am not sure that meaningful clusters can be defined on these coordinates. I suspect that the problem you encountered with kth-Nearest-Neighbor method is due to ties, a problem which is discussed in the documentation but that I don't fully understand.

Anyway, increasing the number of neighbors (k) can bring the number of indentifyable clusters down, but I doubt this is very useful. For reference, starting from your dim1 and dim2 values and k=36 :

proc cluster data=sasforum.knn method=density k=36 outtree=tree plots=none;
var dim1 dim2;
run;

proc tree data=tree nclusters=5 out=treegraph noprint; 
copy dim1 dim2;
run;

proc sql;
select cluster, count(*) as n
from treegraph
group by cluster;
quit;

proc sgplot data=treegraph;
scatter x=dim1 y=dim2 / Group=CLUSTER;
run;

CLUSTER n
1 	1080
2 	124
3 	1
4 	1
5 	1

It is hard to identify visually how meaningful clusters would look like for these data.

PG

Statistical Procedures

How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Re: How to perform a KNN clustering after Proc Corresp

Follow Us

What is...

Statistical Procedures

Our biggest data and AI event of the year.

Follow Us

What is...