BookmarkSubscribeRSS Feed
mcs
Obsidian | Level 7 mcs
Obsidian | Level 7

I've just begun working my way through the exercises in The Elements of Statistical Learning.  Exercise 2.8 asks you to use k-nearest neighbors to classify scanned zipcode digits from greyscale values (gs1-gs256).  Here's some of my code.

 

%macro knn;
%do i = 1 %to 5;
	%let k = %scan(1 3 5 7 15,&i);
	proc discrim data=train method=npar k=&k out=train_k&k._out(keep=digit _into_)
				testdata=test testout=test_k&k._out(keep=digit _into_);	
				class digit;
				var gs1-gs256;
	run;
%end;
%mend;
%knn;

Using just digits 2 and 3, I get error rates on the test datasets (available at the book website) between 6% (for k1) up to 10% (for k15).  Those don't agree with a couple of solutions on the web.  Andrew Tulloch shows error rates between 2% and 4%, while Weatherwax and Epstein have error rates between 9% and 11%.

 

 

Is there anyone else who has done the exercise and can confirm which of the three answers (if any) is correct?

 

Martin

2 REPLIES 2
Reeza
Super User

Is there a reason you used proc discrim instead of proc cluster? or fastclus?

mcs
Obsidian | Level 7 mcs
Obsidian | Level 7

I haven't used either of those before, and after a quick look at the documentation, I couldn't figure out how to make them do what I want.

 

Can you explain how clustering lets me classify digits?  I assume I would cluster the training dataset and then somehow use the output to score the test dataset, but I don't understand the details.  Specifically, how would I use the known value of the digit in the training dataset?

sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is ANOVA?

ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 2 replies
  • 1938 views
  • 1 like
  • 2 in conversation