Re: Ranking prediction | ordered logit/probit

Simon123 · Posted 04-11-2021 05:52 AM

Hi everyone,

when trying to predict the final ranking of a football competition (dependent variable) based multiple independent variables, I wanted to use an ordered logit/probit model.

Therefore I have split my data into training data and test data as I wanted the model to learn from the training data and make predictions about the test data.

My code is the following:

proc logistic data=trainingdata;
model ranking = &inputvariables;
score data=testdata out=work.ologitoutput;
run;

The problem is that the predicted rankings are not unique, not every rank is given exactly one time.
For example, the model places 3 teams on the first position in the final ranking, no team second,... I think that the problem is that the model just assigns each team to the ranking with the highest probability when only taking into account this one team, without looking at the other teams' probabilities.

Could anyone help me out, I would really appreciate it!!
Best regards,
Simon

sbxkoenk · Posted 04-11-2021 09:50 AM

Hello,

You touch an interesting problem.

I think I never had to make an ordered (ordinal) prediction where each 'category' could only be predicted (assigned) once.

Your diagnosis about why you get 3 times a rank 1 prediction and never a rank 2 predictions seems correct to me.

Although an ordered logit model is definitely a good choice,

I doubt that it can all be done with a simple extension to the (simple) code that you provide.

I say that an ordered logit model is a good choice because it outperformed other (more complex) models in this study:

Forecasting the FIFA World Cup – Combining result- and goal-based team ability parameters
Pieter Robberechts and Jesse Davis
KU Leuven, Department of Computer Science

(It's the university where I have studied by the way 😇)

I haven't read the article yet, and hence cannot come up with an answer ready-to-consume.

Look also at this interesting blog (it does not provide an answer to your question though)

Basketball tournaments, Moneyball, and sports analytics
By Robert Allison on SAS Learning Post March 21, 2013
https://blogs.sas.com/content/sastraining/2013/03/21/march-madness-moneyball-and-sports-analytics/

Somebody will for sure provide an appropriate answer. I will follow-up with great interest.

Cheers,

Koen

Simon123 · Posted 04-11-2021 09:57 AM

Hi Koen,
Thank you for your reply.

I have also read some interesting papers that try to predict a final ranking where the ordered probit model outperforms. However, none of them mentioned my problem, so I guess there must be a solution.

Regards,
Simon

mkeintz · Posted 04-11-2021 12:45 PM

Could you come up with a tie-breaker among the current tied ranks (say previous year's ranks?), and then use the new "synthesized ranks" as the outcome measure for your training data set?

Of course, you would want a tie-breaking rule that would make the new ranks completely distinct.

I say the following as a person who once knew a tiny bit about model estimation, but never involved in rank prediction:

Really, tied ranks in the training data set just tells you that there's not much difference between adjacent ranks, yes?

If so, isn't there some justification in randomly breaking ties, with the knowledge that adjacent ranks in the subsequent test data set won't be distinct. Maybe this is a case where a type of random resampling of the randomized tie-breaking of the training data would make sense.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Simon123 · Posted 04-13-2021 03:35 AM

Hi @mkeintz ,

Thank you for your reply, a tie-breaker is an interesting idea!

Proc logistic always estimates the probability of a club achieving ranking x and the ranking with the highest probability is then assigned to the club. Maybe an interesting tie-breaker would be the probability to obtain the first ranking? This would be an intuitive way to give the stronger teams the higher ranking in case of a tie.

Do you think this makes sense?

mkeintz · Posted 04-13-2021 12:32 PM

@Simon123 wrote:

Hi @mkeintz ,

Thank you for your reply, a tie-breaker is an interesting idea!

Proc logistic always estimates the probability of a club achieving ranking x and the ranking with the highest probability is then assigned to the club. Maybe an interesting tie-breaker would be the probability to obtain the first ranking? This would be an intuitive way to give the stronger teams the higher ranking in case of a tie.

Do you think this makes sense?

So you want to run 2 logistics on the training data set: (1) generate probability of achieving first rank to generate tie-break scores, (2) using the adjusted ranks (i.e. with ties broken by those scores) as the basic training data model estimation.

Frankly, I have no opinion, but it does offer the advantage of avoiding the use of some external source of tie-breaking data. Fair warning: I've never done this, so I offer no experience.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

StatDave · Posted 04-11-2021 04:04 PM

Data from sports competitions tend to be presented as results from pairs of competitors and is often modeled using the Bradley-Terry model as discussed and illustrated in this note. It would help if you would describe the data you have. It's not clear what the observations in your data represent and how you already have a ranking variable to use as a response, but it suggests that your data are not like that in the note above.

Simon123 · Posted 04-13-2021 03:44 AM

Hi @StatDave ,

Thank you for your reply, my data looks as follows for each of the entities (football clubs):
football club Year Ranking obtained (=dependent var) Turnover Net income ....

The goal is rather to predict the final ranking directly rather than modeling it indirectly through pairs of competitors because of the 'nature' of the independent variables Turnover, Net income,... The independent variables are financial information from the football clubs' annual reports prior to the season of which the ranking is obtained.

By splitting the first years of my study period into training data and keeping the last years as test data, I hoped that the ordered probit model would notice that in each year, each ranking is given exactly once.

StatDave · Posted 04-13-2021 01:28 PM

So an observation is the ranking of one club in each year. Presumably sets of observations for each of several years. If you assume independent observations, you could fit a binary logistic model (not including Year as a predictor) to a dichotomized version of your response where each club was either ranked 1 in a year or not. You could then use the average predicted probability of rank=1 for each club to order the clubs. If the predictors in the model are really continuous, I wouldn't think there would be any ties - that is, use actual values if available, don't use categorized versions of continuous variables (like income categorized into, say, 5 levels) as that could cause ties.

Catch up on SAS Innovate 2026