I'm working with a team for the upcoming season, and have been tasked with compiling statistics tracking data after every session (training and/or game), and creating cumulative variables based on the data. In addition to creating the cumulative variables, I need to append the relevant variables from the previous 3 sessions that each player has attended, in order to track progression.
The variables that I need to keep from each sessions are called : Distance, Sprint_Distance & Top_Speed. The session data is extracted in CSV format for any player that has attended, and I append each sessions data to a master session dataset, which I need to be in order to create a count of how many sessions each player has attended. When reading in the session data, I append a numeric value to the end of each field, which indicates the session number. So session 1 will have field names:
Distance_1
Sprint_Distance_1
Top_Speed_1
Session 2 will have '_2' appended at the end, session 3 has '_3' appended and so on for each subsequent session.
While I've created the cumulative variables using the MEAN function, I'm having trouble extracting the data from the previous 3 sessions for each player. What I've tried is the following, extracting the last record and then deleting it so the next step will not extract the same record.
data previous1
sessions1;
set path.raw_session_data;
by Player_Name;
if last.Player_Name then output previous1;
if last.Player_Name then delete;
output sessions1;
run;
data previous2
sessions2;
set sessions1;
by Player_Name;
if last.Player_Name then output previous2;
if last.Player_Name then delete;
output sessions2;
run;
data previous3
sessions3;
set sessions2;
by Player_Name;
if last.Player_Name then output previous3;
if last.Player_Name then delete;
output sessions3;
run;
data prev_sessions;
merge previous1
previous2
previous3;
by Player_Name;
run;
What I'm looking to do is to keep only the variables where data is available when identifying the last 3 sessions attended, instead of appending all session variables to each of the previous1-3 datasets being created. The problem is that I don't know the variable names that will be extracted, as players won't have attended every session so far. For example, player 1 has attended sessions 1, 2, 4 and 6, player 2 has attended sessions 1, 3, 5 and 6, player 3 has attended all sessions 1, 2, 3, 4, 5, 6 etc. so these numbers will be appended to the variable names as outlined previously.
I will then need to rename the variables to something like :
Previous_Distance_1 for the most recent previous session;
Previous_Distance_2 for the second most recent previous session;
Previous_Distance_3 for the third most recent previous session;
Hopefully I'm making sense!! Any help will be greatly appreciated!!
Just make a vertical dataset instead. Assuming that you already have the data in this format where the variable names in each dataset is different then here is a quick way to get it into a more normal structure. This assumes that the individual datasets are already sorted by player_name.
data want ;
length Player_Name $30 session_no Distance Sprint_Distance Top_Speed 8;
set session_1 - session_6 ;
by player_name;
if first.player_name then session_no=0;
session_no+1;
distance = coalesce(of distance_:);
sprint_distance = coalesce(of sprint_distance_:);
top_speed = coalesce(of top_speed_:);
keep player_name -- top_speed;
run;
Post test data in the form of a datastep:
Show what the output you want from that test data is.
Due to restrictions with the connection where I am working from at present, I'm unable to create the dataset using the linked instructions. Below is a manual creation of the input session data - hope this works!
data WORK.CLASS(label='Session Data');
infile datalines dsd truncover;
input Name:$10. Distance_1:$10. Sprint_Distance_1:$10. Top_Speed_1:$10.;
datalines;
Player1 5.5397 1259.38 8.009
Player2 5.9846 1009.018 7.5334
Player3 6.1655 913.654 7.7067
Player4 6.2424 881.055 7.9001
Player5 6.1148 1010.969 8.089
;;;;
This will be repeated for subsequent sessions, but if any of players 1-5 didn't attend, then no data will be appended for them.
When trying to extract the last 3 sessions attended per player, the steps listed on the initial steps will append all fields to date i.e. Distance_1-6 etc. I only need to keep the final 3 that are populated.
What I'm ultimately looking for would be something like:
Player Name | Number of Sessions attended | Average Distance (km) | Previous Distance 1 | Previous Distance 2 | Previous Distance 3 | Average Sprint Distance (m) | Previous Sprint Distance 1 | Previous Sprint Distance 2 | Previous Sprint Distance 3 | Average Top Speed | Previous Top Speed 1 | Previous Top Speed 2 | Previous Top Speed 3 |
Player1 | 5 | 6.44056 | A | B | C | 1350.05 | D | E | F | 8.01428 | G | H | I |
Maxim 19: Don't keep Data in Structure
Which means: don't put the session number in column names. Use the same three columns throughout, and put the session number in it's own column. This "long" dataset will be much easier to work with.
How would this work when I need to merge the needed variables together, and then ultimately trying to create the averages?
@ciaranhillery wrote:
How would this work when I need to merge the needed variables together, and then ultimately trying to create the averages?
You use by-group processing in the relevant procedures. The "wide" format prevents this effectively, and is therefore not used for processing, only for display when data is prepared for human consumption.
See this as a short example:
data WORK.CLASS1;
input Name :$10. Distance Sprint_Distance_1 Top_Speed_1;
datalines;
Player1 5.5397 1259.38 8.009
Player2 5.9846 1009.018 7.5334
Player3 6.1655 913.654 7.7067
Player4 6.2424 881.055 7.9001
Player5 6.1148 1010.969 8.089
;
run;
data WORK.CLASS2;
input Name :$10. Distance Sprint_Distance Top_Speed;
datalines;
Player1 5.5397 1259.38 8.009
Player2 5.9846 1009.018 7.5334
Player3 6.1655 913.654 7.7067
Player5 6.1148 1010.969 8.089
;
run;
data WORK.CLASS3;
input Name :$10. Distance Sprint_Distance Top_Speed;
datalines;
Player1 5.5397 1259.38 8.009
Player2 5.9846 1009.018 7.5334
Player4 6.2424 881.055 7.9001
Player5 6.1148 1010.969 8.089
;
run;
data all;
set work.class: indsname=inds;
session = input(substr(scan(inds,2,'.'),6),best.);
run;
proc sort data=all;
by name;
run;
proc means data=all mean;
by name;
var Distance Sprint_Distance_1 Top_Speed_1;
output out=want mean()=;
run;
Just make a vertical dataset instead. Assuming that you already have the data in this format where the variable names in each dataset is different then here is a quick way to get it into a more normal structure. This assumes that the individual datasets are already sorted by player_name.
data want ;
length Player_Name $30 session_no Distance Sprint_Distance Top_Speed 8;
set session_1 - session_6 ;
by player_name;
if first.player_name then session_no=0;
session_no+1;
distance = coalesce(of distance_:);
sprint_distance = coalesce(of sprint_distance_:);
top_speed = coalesce(of top_speed_:);
keep player_name -- top_speed;
run;
Thanks Tom, I've been able to work around this code to get what I need.
Thank you to everyone for getting back to me!
Good news: We've extended SAS Hackathon registration until Sept. 12, so you still have time to be part of our biggest event yet – our five-year anniversary!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.