vignettes/extract-understat-data.Rmd
extract-understat-data.Rmd
This package is designed to allow users to extract various world football results and player statistics from the following popular football (soccer) data sites:
You can install the CRAN version of worldfootballR
with:
install.packages("worldfootballR")
You can install the released version of worldfootballR
from GitHub with:
# install.packages("devtools")
devtools::install_github("JaseZiv/worldfootballR")
Package vignettes have been built to help you get started with the package.
This vignette will cover the functions to extract data from understat.com
To get a list of all season team URLs for selected teams, use the understat_team_meta()
function (note, to get team names, it might be advisable to look at Understat.com’s spelling of the team names and pass that through to the function):
team_urls <- understat_team_meta(team_name = c("Liverpool", "Manchester City"))
This section will cover the functions to aid in the extraction of season league statistics from Understat.
The following leagues are currently supported by Understat (these values can be passed in to the league
arguments of most understat_
functions):
To be able to extract match results from Understat, which not only have results and expected goals, but they also provide a probability of a team winning.
To extract the data, use the understat_league_match_results()
function:
# to get the EPL results:
epl_results <- understat_league_match_results(league = "EPL", season_start_year = 2020)
dplyr::glimpse(epl_results)
#> Rows: 380
#> Columns: 18
#> $ league <chr> "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", "EPL", …
#> $ season <chr> "2020/2021", "2020/2021", "2020/2021", "2020/2021", "202…
#> $ match_id <chr> "14086", "14087", "14090", "14091", "14092", "14093", "1…
#> $ isResult <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
#> $ home_id <chr> "228", "78", "87", "81", "76", "82", "238", "220", "72",…
#> $ home_team <chr> "Fulham", "Crystal Palace", "Liverpool", "West Ham", "We…
#> $ home_abbr <chr> "FLH", "CRY", "LIV", "WHU", "WBA", "TOT", "SHE", "BRI", …
#> $ away_id <chr> "83", "74", "245", "86", "75", "72", "229", "80", "76", …
#> $ away_team <chr> "Arsenal", "Southampton", "Leeds", "Newcastle United", "…
#> $ away_abbr <chr> "ARS", "SOU", "LED", "NEW", "LEI", "EVE", "WOL", "CHE", …
#> $ home_goals <dbl> 0, 1, 4, 0, 0, 0, 0, 1, 5, 4, 1, 2, 2, 0, 0, 4, 1, 1, 2,…
#> $ away_goals <dbl> 3, 0, 3, 2, 3, 1, 2, 3, 2, 3, 3, 1, 5, 3, 2, 2, 0, 3, 3,…
#> $ home_xG <dbl> 0.126327, 1.395690, 3.154120, 0.861445, 0.352997, 0.8229…
#> $ away_xG <dbl> 2.162870, 1.262670, 0.269813, 1.659110, 2.955810, 1.2679…
#> $ datetime <chr> "2020-09-12 11:30:00", "2020-09-12 14:00:00", "2020-09-1…
#> $ forecast_win <dbl> 0.0037, 0.3916, 0.9658, 0.1506, 0.0070, 0.2200, 0.1683, …
#> $ forecast_draw <dbl> 0.0476, 0.3022, 0.0296, 0.2480, 0.0358, 0.2977, 0.2906, …
#> $ forecast_loss <dbl> 0.9487, 0.3062, 0.0046, 0.6014, 0.9572, 0.4823, 0.5411, …
To get shooting locations for a whole season in supported leagues, use the understat_league_season_shots()
function:
ligue1_shot_location <- understat_league_season_shots(league = "Ligue 1", season_start_year = 2020)
The following sections outlines the functions available to extract data at the per-match level
To get shooting locations for an individual match, use the understat_match_shots()
function:
wba_liv_shots <- understat_match_shots(match_url = "https://understat.com/match/14789")
dplyr::glimpse(wba_liv_shots)
#> Rows: 36
#> Columns: 20
#> $ id <chr> "422440", "422441", "422442", "422450", "422456", "422…
#> $ minute <dbl> 9, 11, 14, 35, 46, 47, 50, 61, 70, 77, 2, 3, 5, 23, 26…
#> $ result <chr> "MissedShots", "MissedShots", "Goal", "BlockedShot", "…
#> $ X <dbl> 0.869, 0.965, 0.881, 0.883, 0.957, 0.712, 0.767, 0.942…
#> $ Y <dbl> 0.441, 0.460, 0.356, 0.336, 0.590, 0.403, 0.590, 0.626…
#> $ xG <dbl> 0.03135269, 0.14474539, 0.23826571, 0.28253871, 0.0260…
#> $ player <chr> "Semi Ajayi", "Okay Yokuslu", "Hal Robson-Kanu", "Hal …
#> $ home_away <chr> "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "a",…
#> $ player_id <chr> "4490", "6932", "1738", "1738", "964", "7153", "7153",…
#> $ situation <chr> "SetPiece", "SetPiece", "OpenPlay", "OpenPlay", "FromC…
#> $ season <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"…
#> $ shotType <chr> "Head", "Head", "LeftFoot", "LeftFoot", "Head", "LeftF…
#> $ match_id <chr> "14789", "14789", "14789", "14789", "14789", "14789", …
#> $ home_team <chr> "West Bromwich Albion", "West Bromwich Albion", "West …
#> $ away_team <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "L…
#> $ home_goals <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ away_goals <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
#> $ date <chr> "2021-05-16 15:30:00", "2021-05-16 15:30:00", "2021-05…
#> $ player_assisted <chr> "Matheus Pereira", "Darnell Furlong", "Matheus Pereira…
#> $ lastAction <chr> "Cross", "Chipped", "Pass", "HeadPass", "Aerial", "Sta…
This section will cover off the functions to get team-level data from Transfermarkt.
To get all shots taken and conceded by a team during a season, use the understat_team_season_shots()
function:
# for one team:
man_city_shots <- understat_team_season_shots(team_url = "https://understat.com/team/Manchester_City/2020")
dplyr::glimpse(man_city_shots)
#> Rows: 886
#> Columns: 20
#> $ id <chr> "378528", "378533", "378537", "378538", "378539", "378…
#> $ minute <dbl> 15, 40, 53, 55, 58, 59, 64, 73, 77, 86, 7, 10, 19, 29,…
#> $ result <chr> "BlockedShot", "MissedShots", "MissedShots", "BlockedS…
#> $ X <dbl> 0.789, 0.892, 0.860, 0.811, 0.822, 0.886, 0.869, 0.803…
#> $ Y <dbl> 0.564, 0.409, 0.501, 0.496, 0.398, 0.473, 0.259, 0.467…
#> $ xG <dbl> 0.034228612, 0.036804341, 0.103135295, 0.053397592, 0.…
#> $ player <chr> "Pedro Neto", "Raúl Jiménez", "Daniel Podence", "Rúben…
#> $ home_away <chr> "h", "h", "h", "h", "h", "h", "h", "h", "h", "h", "a",…
#> $ player_id <chr> "6382", "4105", "8291", "6853", "8291", "4105", "6853"…
#> $ situation <chr> "OpenPlay", "FromCorner", "OpenPlay", "OpenPlay", "Ope…
#> $ season <chr> "2020", "2020", "2020", "2020", "2020", "2020", "2020"…
#> $ shotType <chr> "LeftFoot", "Head", "LeftFoot", "LeftFoot", "RightFoot…
#> $ match_id <chr> "14105", "14105", "14105", "14105", "14105", "14105", …
#> $ home_team <chr> "Wolverhampton Wanderers", "Wolverhampton Wanderers", …
#> $ away_team <chr> "Manchester City", "Manchester City", "Manchester City…
#> $ home_goals <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ away_goals <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
#> $ date <chr> "2020-09-21 19:15:00", "2020-09-21 19:15:00", "2020-09…
#> $ player_assisted <chr> "Daniel Podence", "Adama Traoré", "Adama Traoré", "Ped…
#> $ lastAction <chr> "Pass", "Cross", "Pass", "Pass", "Chipped", "Cross", "…
To get a more granular breakdown of team shooting data for whole seasons, the understat_team_stats_breakdown()
function can be used. This functions returns a breakdown of team shooting data based on the following groupings:
#----- Can get data for single teams at a time: -----#
team_breakdown <- understat_team_stats_breakdown(team_urls = "https://understat.com/team/Liverpool/2020")
dplyr::glimpse(team_breakdown)
#> Rows: 34
#> Columns: 11
#> $ team_name <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", …
#> $ season_start_year <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
#> $ stat_group_name <chr> "situation", "situation", "situation", "situation", …
#> $ stat_name <chr> "OpenPlay", "FromCorner", "SetPiece", "DirectFreekic…
#> $ shots <int> 466, 94, 23, 22, 6, 532, 33, 31, 13, 2, 302, 135, 10…
#> $ goals <int> 49, 11, 2, 0, 6, 59, 6, 1, 2, 0, 32, 15, 8, 11, 2, 6…
#> $ xG <dbl> 59.4529072, 9.1182850, 1.8527942, 1.3437831, 4.56701…
#> $ against.shots <int> 252, 40, 21, 12, 8, 296, 20, 9, 7, 1, 161, 80, 45, 3…
#> $ against.goals <int> 28, 6, 3, 1, 4, 38, 2, 1, 0, 1, 17, 12, 5, 3, 5, 8, …
#> $ against.xG <dbl> 33.1091679, 4.2281565, 3.9210227, 0.6303308, 6.08935…
#> $ time <int> NA, NA, NA, NA, NA, 3147, 216, 134, 81, 12, 1914, 73…
#----- Or for multiple teams: -----#
# team_urls <- c("https://understat.com/team/Liverpool/2020",
# "https://understat.com/team/Manchester_City/2020")
# team_breakdown <- understat_team_stats_breakdown(team_urls = team_urls)
This section will cover the functions available to aid in the extraction of player data.
To get shooting locations for all games a player has participated in (for as long as Understat has data for), use the understat_player_shots()
function:
raheem_sterling_shots <- understat_player_shots(player_url = "https://understat.com/player/618")
dplyr::glimpse(raheem_sterling_shots)
#> Rows: 592
#> Columns: 20
#> $ id <chr> "14490", "14491", "14496", "14497", "14779", "15104", …
#> $ minute <dbl> 20, 22, 47, 53, 8, 7, 69, 74, 65, 81, 19, 25, 47, 50, …
#> $ result <chr> "SavedShot", "Goal", "SavedShot", "MissedShots", "Miss…
#> $ X <dbl> 0.853, 0.856, 0.816, 0.745, 0.857, 0.959, 0.940, 0.968…
#> $ Y <dbl> 0.695, 0.496, 0.377, 0.443, 0.470, 0.615, 0.524, 0.646…
#> $ xG <dbl> 0.04070334, 0.31140882, 0.05760121, 0.02548105, 0.0726…
#> $ player <chr> "Raheem Sterling", "Raheem Sterling", "Raheem Sterling…
#> $ home_away <chr> "h", "h", "h", "h", "a", "a", "a", "a", "h", "h", "a",…
#> $ player_id <chr> "618", "618", "618", "618", "618", "618", "618", "618"…
#> $ situation <chr> "OpenPlay", "OpenPlay", "OpenPlay", "OpenPlay", "OpenP…
#> $ season <chr> "2014", "2014", "2014", "2014", "2014", "2014", "2014"…
#> $ shotType <chr> "LeftFoot", "RightFoot", "RightFoot", "RightFoot", "Ri…
#> $ match_id <chr> "4756", "4756", "4756", "4756", "4768", "4777", "4777"…
#> $ home_team <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "M…
#> $ away_team <chr> "Southampton", "Southampton", "Southampton", "Southamp…
#> $ home_goals <dbl> 2, 2, 2, 2, 3, 0, 0, 0, 0, 0, 3, 3, 3, 3, 1, 1, 1, 1, …
#> $ away_goals <dbl> 1, 1, 1, 1, 1, 3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
#> $ date <chr> "2014-08-17 13:30:00", "2014-08-17 13:30:00", "2014-08…
#> $ player_assisted <chr> "Philippe Coutinho", "Jordan Henderson", "Jordan Hende…
#> $ lastAction <chr> "Pass", "Throughball", "Pass", "Pass", "Chipped", "Pas…
To get stats for all players of selected teams, run the understat_team_players_stats()
function.
Note: Team URLs cal be extracted using understat_team_meta()
.
team_players <- understat_team_players_stats(team_url = c("https://understat.com/team/Liverpool/2020", "https://understat.com/team/Manchester_City/2020"))
dplyr::glimpse(team_players)
#> Rows: 52
#> Columns: 19
#> $ season <chr> "2020/2021", "2020/2021", "2020/2021", "2020/2021", "2020…
#> $ player_id <dbl> 1250, 838, 482, 6854, 771, 1791, 229, 332, 605, 833, 966,…
#> $ player_name <chr> "Mohamed Salah", "Sadio Mané", "Roberto Firmino", "Diogo …
#> $ games <dbl> 37, 35, 36, 19, 38, 36, 24, 10, 21, 5, 13, 33, 38, 24, 17…
#> $ time <dbl> 3085, 2805, 2882, 1114, 2961, 3040, 1865, 701, 1710, 370,…
#> $ goals <dbl> 22, 11, 9, 9, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0…
#> $ xG <dbl> 20.25084664, 14.82854776, 12.86021341, 7.05772054, 2.8174…
#> $ assists <dbl> 5, 7, 7, 0, 0, 7, 0, 2, 1, 0, 1, 0, 7, 2, 1, 0, 0, 1, 0, …
#> $ xA <dbl> 6.52852610, 7.78775485, 6.11686340, 1.76252108, 1.6629218…
#> $ shots <dbl> 126, 94, 83, 46, 31, 55, 22, 5, 14, 4, 8, 1, 19, 19, 15, …
#> $ key_passes <dbl> 55, 61, 44, 12, 21, 77, 30, 3, 14, 0, 2, 0, 65, 12, 7, 0,…
#> $ yellow_cards <dbl> 0, 3, 2, 2, 1, 2, 4, 2, 0, 1, 0, 1, 2, 2, 2, 0, 0, 3, 0, …
#> $ red_cards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
#> $ position <chr> "F M S", "F M S", "F M S", "F M S", "M S", "D S", "M S", …
#> $ team_name <chr> "Liverpool", "Liverpool", "Liverpool", "Liverpool", "Live…
#> $ npg <dbl> 16, 11, 9, 9, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0…
#> $ npxG <dbl> 15.68383364, 14.82854776, 12.86021341, 7.05772054, 2.8174…
#> $ xGChain <dbl> 28.9682335, 24.9989232, 25.2714578, 10.9729747, 13.922181…
#> $ xGBuildup <dbl> 9.8002357, 6.0576563, 10.1985501, 4.0760965, 10.4762768, …