BookmarkSubscribeRSS Feed
kisumsam
Quartz | Level 8

Hi there, I'm learning KNN. I found that my Proc Discrim procedure gives me a much better results than me doing the manual calculation for the KNN algorithm. I'm wondering if there is any expert here who can explain why Proc Discrim does so much better.

 

For example, below is the code that I use to classify the fish species in the SASHelp.fish data set.

 

** Standardize Columns **;
proc standard data=sashelp.fish out=fish mean=0 std=1;
var weight length1 length2 length3 height width;
run;

data fish_train fish_test;
set fish;
rand = ranuni(100);
if rand <= 0.5 then output fish_train;
else output fish_test;
run;

** Using Built-in Proc Discrim **;
proc discrim data = fish_train test = fish_test 
  testout = _score1 method = npar k = 9 testlist;
  class species;
  var weight height length1 length2 length3 width;
run; 

The error rate is very low:

 

i1001.png

 

Now, I'm doing it manually by calculating the distance between the points and find the K nearest neighbor (k=9).

 

** Manually build do KNN **;
data train1 train2 (drop=num);
set fish_train;
num = _n_;
run;

proc sql;
create table train_combine as
select a.num, 
       a.species as species_a,
       b.species as species_b,
       sqrt((a.weight - b.weight)**2 + 
            (a.height - b.height)**2 +
            (a.length1 - b.length1)**2 +
            (a.length2 - b.length2)**2 +
            (a.length3 - b.length3)**2 +
            (a.width - b.width)**2       
            ) as distance
from train1 a, train2 b
order by a.num, distance;
quit;

data train_combine2;
set train_combine;
by num distance;
if first.num then i = 0;
i + 1;
if i <= 9;
run;
       
proc freq data=train_combine2 noprint;
table species_b / out = fish_freq;
by num species_a;
run;

proc sort data=fish_freq; by num count; run;

data fish_freq2;
set fish_freq;
by num count;
if last.num;

if species_a = species_b then match = "Y";
else match = "N";
run;

proc sql;
select species_a, match, count(*) as cnt
from fish_freq2
group by species_a, match
order by species_a, match;
quit;

I did the Euclidean distance. And the results are not even close to being as good as Proc Discrim. 

 

For example, my manual model classified it all wrong for Parkki. It got only one right for Roach.

 

i1002.png 

 

In contrast, Proc Discrim classifies 4 Parkki and 9 Roach correctly.

 

i1003.png

 

How does the Proc Discrim algorithm work that gives the better classification results?

3 REPLIES 3
Ksharp
Super User
Did you try PROC MODCLUS to run KNN ?
kisumsam
Quartz | Level 8
No. Is that a better procedure than Proc Discrim for KNN?
WarrenKuhfeld
Rhodochrosite | Level 12

 >I found that my Proc Discrim procedure gives me a much better results than me doing the manual calculation for the KNN algorithm.

>I'm wondering if there is any expert here who can explain why Proc Discrim does so much better.

 

It is simple. Proc discrim was written by an expert. You are not replicating what discrim does. 

 

https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_discrim_details02.htm&docsetVersi...

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to choose a machine learning algorithm

Use this tutorial as a handy guide to weigh the pros and cons of these commonly used machine learning algorithms.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 3 replies
  • 999 views
  • 0 likes
  • 3 in conversation