Good Evening,
I am required to do some analysis of Football (Soccer) teams. In a dataset of match history (sorted ascending by date), I have the following fields:
HomeTeam
AwayTeam
FTR
FTR being the full time result and can take the value of "H" (Home Win) "D" (Draw) or "A" (Away Win)
For each match, I need to count the number of Wins/Draws/Losses for each teams last 36 games.
So for instance, say an observation in the dataset is Manchester United v Arsenal, I want to count number of Win/Draw/Losses for Manchester United in their previous 36 games (consisting of both their home & away games) and also the same for Arsenal.
So the output would look like the following (sample data used):
HomeTeam AwayTeam FTR H_Form_W H_Form_D Home_Form_L A_Form_W A_Form_D A_Form_L
Man UTD Arsenal H 25 5 6 19 6 10
This would seemingly be easy if only home games were needed for the home team and away games for the away team, however, I suspect a more complex macro is needed to handle the fact that when querying the dataset, the object we are looking for can appear in both the home & away team field.
Many Thanks for any help,
cxkev182
Assuming you need the last 36 games, no matter whether they are home or away:
data split;
set have;
team = HomeTeam;
if FTR='H' them result='W';
else if FTR='A' then result='L';
else result='D';
output;
team = AwayTeam;
if FTR='A' then result='W';
else if FTR = 'H' then result='L';
else result='D';
output;
keep team date result;
run;
proc sort data=split;
by team descending date;
run;
data want;
set split;
by team;
if first.team then do;
wins = 0;
losses = 0;
draws = 0;
end;
if wins + losses + draws < 36 then do;
if result='W' then wins + 1;
else if result='L' then losses + 1;
else draws + 1;
end;
if last.team;
run;
Does this meet all the requirements?
The 'brute force' method would be to use the 'point' option on the set statement to loop back through the file identifying whether the team in question was in the match, recording the result, and stopping once you have reached 36 matches or the start of the file. No macros required.
BruceBrad, first of all, thanks very much for the reply. I am not familiar with the point option. I have googled to get some examples but can't seem to get my head round it. Could I ask for a sample piece of code how I could code this?
Thanks again for your help, I really appreciate it.
Reasonably simple solutions should be possible. Is there an additional variable in your data set that tells you the date of the match? How do you know that the observations are in order? Start there, and I'll be able to post something later this afternoon.
Astounding, yes sorry there is a variable simply called 'Date', It's from this that I know and can keep the dataset in order. Look forward to any help you can give.
Have you considered changing your data structure?
Team Location FTR form_w form_d form_l
Manchester Away H 25 5 6
Arsenal Home AL 19 6 10
I think this may be more conducive to data analysis.
Actually, I think Reeza's approach is the neatest way to go.
Step 1. Create two records from each of your records - one for each team as Reeza suggests
Step 2. Sort by date within team, and create a variable COUNT that counts the number of times the team has competed. That is, starting at 1 for the first occurence of the team etc.
Step 3. Create a dataset which only has the last record for each team.
Step 3. Merge this dataset with the step 2 dataset, selecting records where count is greater than the current value of count minus 16.
Step 4. Summarise this merged data, counting the number of wins etc.
Bruce,
This sounds about right, and was along the lines of what I was thinking as well. One suggestion: If you sort within team by DESCENDING date, you can accomplish the rest in a single additional DATA step.
Good luck.
Thanks again for the reply all. I think this approach would only get me the last x home games and last x away games for each teams, however, because teams don't always play home, away, home I wouldn't be capturing the right data in some instances for example if a team plays 2 home games in a row. I will definitely go with this approach however, if there is no alternative.
Assuming you need the last 36 games, no matter whether they are home or away:
data split;
set have;
team = HomeTeam;
if FTR='H' them result='W';
else if FTR='A' then result='L';
else result='D';
output;
team = AwayTeam;
if FTR='A' then result='W';
else if FTR = 'H' then result='L';
else result='D';
output;
keep team date result;
run;
proc sort data=split;
by team descending date;
run;
data want;
set split;
by team;
if first.team then do;
wins = 0;
losses = 0;
draws = 0;
end;
if wins + losses + draws < 36 then do;
if result='W' then wins + 1;
else if result='L' then losses + 1;
else draws + 1;
end;
if last.team;
run;
Does this meet all the requirements?
Thanks to everyone for the help as you have all contributed. I wish I could give you all the correct answer as Reeza had the initial idea of structuring the data in this way, Bruce elaborated further on it, but I have set the correct answer to Astounding for taking the ideas and putting them into code for me.
Many thanks to all.
SAS Innovate 2025 is scheduled for May 6-9 in Orlando, FL. Sign up to be first to learn about the agenda and registration!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.