05-11-2018 11:59 AM - edited 05-11-2018 12:17 PM
I'm taking a grad-level multivariate stats class and we've got a very interesting (at least to me) project where we're supposed to try to predict whether or not eight baseball players who are about to be hall of fame eligible will get elected. Focus is on position players (no pitchers) and only offensive based stats are being used. Dataset includes every player who has ever received at least one vote by the BBWAA and says whether or not they're in the HOF (see attached).
Stats are: Yrs WAR WAR7 JAWS Jpos G AB R H HR RBI SB BB BA OBP SLG OPS OPSadj
Here are the definitions of the not-so-obvious of these:
WAR7: The sum of the seven best seasons of WAR in the player’s career. It may not be seven seasons in a row.
JAWS: Developed by Baseball Prospectus. It contains a combination of career and 7-year peak WAR totals allowing for comparison to average Hall of Fame players by position.
Jpos: The average JAWS score for all Hall of Fame players at this position plus overall Hall of Fame averages for positions with fewer inducted players.
OPSadj: OPS adjusted to the player’s ball park. 100 is an average hitter.
I took out a handful of PEDS players who would have definitely gotten in but haven't and/or won't (Bonds, McGwire, Sosa, Manny, Canseco, Palmeiro, Sheffield) and I took out Pete Rose.
I'm a fan of baseball, so I'm familiar with most of these stats. However, I still ran proc corr to look at correlated variables. I then have been running proc descrim, starting with all variables and then removing one or two at a time to reduce my error count proportions.
Best I could do is 12.87% (which is great, statistically), but that's after I took out HR's. While the BBWAA is getting younger writers/voters who look at the more advanced stats, and while there's high correlation among many of these stats, I feel like there are certain stats that shouldn't be taken out because they're so fundamental to HOF players, like HR's.
So long story long, any ideas on any variables I should leave in because they're so fundamental--even if they increase my error count? Also, one thing to keep in mind is that the error count will all variables is 14.38%. Taking out 8 variables allowed me to decrease to 12.99%. Taking out HR after that reduced error by only 0.12%, to 12.87%
Any advice is much-appreciated.
05-11-2018 01:49 PM
This is a common problem in any modeling situation.
Do you want an empirical model, regardless of whether or not "expected" variables are left out? Do you want a first principle model, based upon some subject matter expertise (like e=MC**2)? Do you want a combination of the two approaches?
There is no right or wrong here. You pick the path that you think is best for your situation.
05-13-2018 12:15 AM
Does your project require that you use DA? You can get a decent decision tree model involving only 2 variables (I let you find out which) with
data test; length Name $16 HoF $3; infile "&sasforum.\datasets\baseballHOF without known or severly suspected PED users AND without Pete Rose.csv" truncover dsd firstobs=2; input Name HoF Yrs WAR WAR7 JAWS Jpos G AB R H HR RBI SB BB BA OBP SLG OPS OPSadj; run; proc hpsplit data=test maxdepth=5 maxbranch=4; class HoF; model HoF (event="Yes") = Yrs -- OPSadj; id name; grow entropy; prune costcomplexity; run;