Programming the statistical procedures from SAS

K-Nearest Neighbors for Zip Code Digits

Contributor mcs
Posts: 21

K-Nearest Neighbors for Zip Code Digits

[ Edited ]

I've just begun working my way through the exercises in The Elements of Statistical Learning.  Exercise 2.8 asks you to use k-nearest neighbors to classify scanned zipcode digits from greyscale values (gs1-gs256).  Here's some of my code.


%macro knn;
%do i = 1 %to 5;
	%let k = %scan(1 3 5 7 15,&i);
	proc discrim data=train method=npar k=&k out=train_k&k._out(keep=digit _into_)
				testdata=test testout=test_k&k._out(keep=digit _into_);	
				class digit;
				var gs1-gs256;

Using just digits 2 and 3, I get error rates on the test datasets (available at the book website) between 6% (for k1) up to 10% (for k15).  Those don't agree with a couple of solutions on the web.  Andrew Tulloch shows error rates between 2% and 4%, while Weatherwax and Epstein have error rates between 9% and 11%.



Is there anyone else who has done the exercise and can confirm which of the three answers (if any) is correct?



Super User
Posts: 18,580

Re: K-Nearest Neighbors for Zip Code Digits

Is there a reason you used proc discrim instead of proc cluster? or fastclus?

Contributor mcs
Posts: 21

Re: K-Nearest Neighbors for Zip Code Digits

I haven't used either of those before, and after a quick look at the documentation, I couldn't figure out how to make them do what I want.


Can you explain how clustering lets me classify digits?  I assume I would cluster the training dataset and then somehow use the output to score the test dataset, but I don't understand the details.  Specifically, how would I use the known value of the digit in the training dataset?

Ask a Question
Discussion stats
  • 2 replies
  • 1 like
  • 2 in conversation