How to replicate KNN results from Proc Discrim

kisumsam — Thu, 13 Aug 2020 13:28:46 GMT

Hi there, I'm learning KNN. I found that my Proc Discrim procedure gives me a much better results than me doing the manual calculation for the KNN algorithm. I'm wondering if there is any expert here who can explain why Proc Discrim does so much better.

For example, below is the code that I use to classify the fish species in the SASHelp.fish data set.

** Standardize Columns **;
proc standard data=sashelp.fish out=fish mean=0 std=1;
var weight length1 length2 length3 height width;
run;

data fish_train fish_test;
set fish;
rand = ranuni(100);
if rand <= 0.5 then output fish_train;
else output fish_test;
run;

** Using Built-in Proc Discrim **;
proc discrim data = fish_train test = fish_test 
  testout = _score1 method = npar k = 9 testlist;
  class species;
  var weight height length1 length2 length3 width;
run;

The error rate is very low:

Now, I'm doing it manually by calculating the distance between the points and find the K nearest neighbor (k=9).

** Manually build do KNN **;
data train1 train2 (drop=num);
set fish_train;
num = _n_;
run;

proc sql;
create table train_combine as
select a.num, 
       a.species as species_a,
       b.species as species_b,
       sqrt((a.weight - b.weight)**2 + 
            (a.height - b.height)**2 +
            (a.length1 - b.length1)**2 +
            (a.length2 - b.length2)**2 +
            (a.length3 - b.length3)**2 +
            (a.width - b.width)**2       
            ) as distance
from train1 a, train2 b
order by a.num, distance;
quit;

data train_combine2;
set train_combine;
by num distance;
if first.num then i = 0;
i + 1;
if i <= 9;
run;
       
proc freq data=train_combine2 noprint;
table species_b / out = fish_freq;
by num species_a;
run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;
set fish_freq;
by num count;
if last.num;

if species_a = species_b then match = "Y";
else match = "N";
run;

proc sql;
select species_a, match, count(*) as cnt
from fish_freq2
group by species_a, match
order by species_a, match;
quit;

I did the Euclidean distance. And the results are not even close to being as good as Proc Discrim.

For example, my manual model classified it all wrong for Parkki. It got only one right for Roach.

In contrast, Proc Discrim classifies 4 Parkki and 9 Roach correctly.

How does the Proc Discrim algorithm work that gives the better classification results?

Re: How to replicate KNN results from Proc Discrim

Ksharp — Thu, 13 Aug 2020 13:46:48 GMT

Did you try PROC MODCLUS to run KNN ?

Re: How to replicate KNN results from Proc Discrim

kisumsam — Thu, 13 Aug 2020 14:07:43 GMT

No. Is that a better procedure than Proc Discrim for KNN?

Re: How to replicate KNN results from Proc Discrim

WarrenKuhfeld — Thu, 13 Aug 2020 17:18:45 GMT

>I found that my Proc Discrim procedure gives me a much better results than me doing the manual calculation for the KNN algorithm.

>I'm wondering if there is any expert here who can explain why Proc Discrim does so much better.

It is simple. Proc discrim was written by an expert. You are not replicating what discrim does.

https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_discrim_details02.htm&docsetVersion=15.1&locale=en

topic How to replicate KNN results from Proc Discrim in SAS Data Science

How to replicate KNN results from Proc Discrim

Re: How to replicate KNN results from Proc Discrim

Re: How to replicate KNN results from Proc Discrim

Re: How to replicate KNN results from Proc Discrim