Hi everyone,
when trying to predict the final ranking of a football competition (dependent variable) based multiple independent variables, I wanted to use an ordered logit/probit model.
Therefore I have split my data into training data and test data as I wanted the model to learn from the training data and make predictions about the test data.
My code is the following:
proc logistic data=trainingdata;
model ranking = &inputvariables;
score data=testdata out=work.ologitoutput;
run;
The problem is that the predicted rankings are not unique, not every rank is given exactly one time.
For example, the model places 3 teams on the first position in the final ranking, no team second,... I think that the problem is that the model just assigns each team to the ranking with the highest probability when only taking into account this one team, without looking at the other teams' probabilities.
Could anyone help me out, I would really appreciate it!!
Best regards,
Simon
Hello,
You touch an interesting problem.
I think I never had to make an ordered (ordinal) prediction where each 'category' could only be predicted (assigned) once.
Your diagnosis about why you get 3 times a rank 1 prediction and never a rank 2 predictions seems correct to me.
Although an ordered logit model is definitely a good choice,
I doubt that it can all be done with a simple extension to the (simple) code that you provide.
I say that an ordered logit model is a good choice because it outperformed other (more complex) models in this study:
Forecasting the FIFA World Cup – Combining result- and goal-based team ability parameters
Pieter Robberechts and Jesse Davis
KU Leuven, Department of Computer Science
(It's the university where I have studied by the way 😇)
I haven't read the article yet, and hence cannot come up with an answer ready-to-consume.
Look also at this interesting blog (it does not provide an answer to your question though)
Basketball tournaments, Moneyball, and sports analytics
By Robert Allison on SAS Learning Post March 21, 2013
https://blogs.sas.com/content/sastraining/2013/03/21/march-madness-moneyball-and-sports-analytics/
Somebody will for sure provide an appropriate answer. I will follow-up with great interest.
Cheers,
Koen
Could you come up with a tie-breaker among the current tied ranks (say previous year's ranks?), and then use the new "synthesized ranks" as the outcome measure for your training data set?
Of course, you would want a tie-breaking rule that would make the new ranks completely distinct.
I say the following as a person who once knew a tiny bit about model estimation, but never involved in rank prediction:
Really, tied ranks in the training data set just tells you that there's not much difference between adjacent ranks, yes?
If so, isn't there some justification in randomly breaking ties, with the knowledge that adjacent ranks in the subsequent test data set won't be distinct. Maybe this is a case where a type of random resampling of the randomized tie-breaking of the training data would make sense.
Hi @mkeintz ,
Thank you for your reply, a tie-breaker is an interesting idea!
Proc logistic always estimates the probability of a club achieving ranking x and the ranking with the highest probability is then assigned to the club. Maybe an interesting tie-breaker would be the probability to obtain the first ranking? This would be an intuitive way to give the stronger teams the higher ranking in case of a tie.
Do you think this makes sense?
@Simon123 wrote:
Hi @mkeintz ,
Thank you for your reply, a tie-breaker is an interesting idea!
Proc logistic always estimates the probability of a club achieving ranking x and the ranking with the highest probability is then assigned to the club. Maybe an interesting tie-breaker would be the probability to obtain the first ranking? This would be an intuitive way to give the stronger teams the higher ranking in case of a tie.
Do you think this makes sense?
So you want to run 2 logistics on the training data set: (1) generate probability of achieving first rank to generate tie-break scores, (2) using the adjusted ranks (i.e. with ties broken by those scores) as the basic training data model estimation.
Frankly, I have no opinion, but it does offer the advantage of avoiding the use of some external source of tie-breaking data. Fair warning: I've never done this, so I offer no experience.
Data from sports competitions tend to be presented as results from pairs of competitors and is often modeled using the Bradley-Terry model as discussed and illustrated in this note. It would help if you would describe the data you have. It's not clear what the observations in your data represent and how you already have a ranking variable to use as a response, but it suggests that your data are not like that in the note above.
Hi @StatDave ,
Thank you for your reply, my data looks as follows for each of the entities (football clubs):
football club Year Ranking obtained (=dependent var) Turnover Net income ....
The goal is rather to predict the final ranking directly rather than modeling it indirectly through pairs of competitors because of the 'nature' of the independent variables Turnover, Net income,... The independent variables are financial information from the football clubs' annual reports prior to the season of which the ranking is obtained.
By splitting the first years of my study period into training data and keeping the last years as test data, I hoped that the ordered probit model would notice that in each year, each ranking is given exactly once.
So an observation is the ranking of one club in each year. Presumably sets of observations for each of several years. If you assume independent observations, you could fit a binary logistic model (not including Year as a predictor) to a dichotomized version of your response where each club was either ranked 1 in a year or not. You could then use the average predicted probability of rank=1 for each club to order the clubs. If the predictors in the model are really continuous, I wouldn't think there would be any ties - that is, use actual values if available, don't use categorized versions of continuous variables (like income categorized into, say, 5 levels) as that could cause ties.
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
ANOVA, or Analysis Of Variance, is used to compare the averages or means of two or more populations to better understand how they differ. Watch this tutorial for more.
Find more tutorials on the SAS Users YouTube channel.