topic K-Nearest Neighbors for Zip Code Digits in Statistical Procedures

K-Nearest Neighbors for Zip Code Digits

mcs — Wed, 13 Jul 2016 15:12:05 GMT

I've just begun working my way through the exercises in The Elements of Statistical Learning. Exercise 2.8 asks you to use k-nearest neighbors to classify scanned zipcode digits from greyscale values (gs1-gs256). Here's some of my code.

%macro knn;
%do i = 1 %to 5;
	%let k = %scan(1 3 5 7 15,&i);
	proc discrim data=train method=npar k=&k out=train_k&k._out(keep=digit _into_)
				testdata=test testout=test_k&k._out(keep=digit _into_);	
				class digit;
				var gs1-gs256;
	run;
%end;
%mend;
%knn;

Using just digits 2 and 3, I get error rates on the test datasets (available at the book website) between 6% (for k1) up to 10% (for k15). Those don't agree with a couple of solutions on the web. Andrew Tulloch shows error rates between 2% and 4%, while Weatherwax and Epstein have error rates between 9% and 11%.

Is there anyone else who has done the exercise and can confirm which of the three answers (if any) is correct?

Martin

Re: K-Nearest Neighbors for Zip Code Digits

Reeza — Wed, 13 Jul 2016 00:53:36 GMT

Is there a reason you used proc discrim instead of proc cluster? or fastclus?

Re: K-Nearest Neighbors for Zip Code Digits

mcs — Wed, 13 Jul 2016 15:10:11 GMT

I haven't used either of those before, and after a quick look at the documentation, I couldn't figure out how to make them do what I want.

Can you explain how clustering lets me classify digits? I assume I would cluster the training dataset and then somehow use the output to score the test dataset, but I don't understand the details. Specifically, how would I use the known value of the digit in the training dataset?