Contributor
Posts: 21

K-Nearest Neighbors for Zip Code Digits

[ Edited ]

I've just begun working my way through the exercises in The Elements of Statistical Learning.  Exercise 2.8 asks you to use k-nearest neighbors to classify scanned zipcode digits from greyscale values (gs1-gs256).  Here's some of my code.

``````%macro knn;
%do i = 1 %to 5;
%let k = %scan(1 3 5 7 15,&i);
proc discrim data=train method=npar k=&k out=train_k&k._out(keep=digit _into_)
testdata=test testout=test_k&k._out(keep=digit _into_);
class digit;
var gs1-gs256;
run;
%end;
%mend;
%knn;``````

Using just digits 2 and 3, I get error rates on the test datasets (available at the book website) between 6% (for k1) up to 10% (for k15).  Those don't agree with a couple of solutions on the web.  Andrew Tulloch shows error rates between 2% and 4%, while Weatherwax and Epstein have error rates between 9% and 11%.

Is there anyone else who has done the exercise and can confirm which of the three answers (if any) is correct?

Martin

Super User
Posts: 20,715

Re: K-Nearest Neighbors for Zip Code Digits

Is there a reason you used proc discrim instead of proc cluster? or fastclus?

Contributor
Posts: 21

Re: K-Nearest Neighbors for Zip Code Digits

I haven't used either of those before, and after a quick look at the documentation, I couldn't figure out how to make them do what I want.

Can you explain how clustering lets me classify digits?  I assume I would cluster the training dataset and then somehow use the output to score the test dataset, but I don't understand the details.  Specifically, how would I use the known value of the digit in the training dataset?

Discussion stats
• 2 replies
• 256 views
• 1 like
• 2 in conversation