07-12-2016 04:34 PM - edited 07-13-2016 11:12 AM
I've just begun working my way through the exercises in The Elements of Statistical Learning. Exercise 2.8 asks you to use k-nearest neighbors to classify scanned zipcode digits from greyscale values (gs1-gs256). Here's some of my code.
%macro knn; %do i = 1 %to 5; %let k = %scan(1 3 5 7 15,&i); proc discrim data=train method=npar k=&k out=train_k&k._out(keep=digit _into_) testdata=test testout=test_k&k._out(keep=digit _into_); class digit; var gs1-gs256; run; %end; %mend; %knn;
Using just digits 2 and 3, I get error rates on the test datasets (available at the book website) between 6% (for k1) up to 10% (for k15). Those don't agree with a couple of solutions on the web. Andrew Tulloch shows error rates between 2% and 4%, while Weatherwax and Epstein have error rates between 9% and 11%.
Is there anyone else who has done the exercise and can confirm which of the three answers (if any) is correct?
07-13-2016 11:10 AM
I haven't used either of those before, and after a quick look at the documentation, I couldn't figure out how to make them do what I want.
Can you explain how clustering lets me classify digits? I assume I would cluster the training dataset and then somehow use the output to score the test dataset, but I don't understand the details. Specifically, how would I use the known value of the digit in the training dataset?