-
PDF
- Split View
-
Views
-
Cite
Cite
Philip Pallmann, Analyzing Baseball Data with R, Journal of the Royal Statistical Society Series A: Statistics in Society, Volume 178, Issue 4, October 2015, Page 1099, https://doi.org/10.1111/rssa.3_12138
Close - Share Icon Share
Analyzing Baseball Data with R is at the same time a quick guide to R programming, a primer on baseball data analysis (‘sabermetrics’) and an expedition into some highlights in professional baseball's history. The authors present a potpourri of well-conceived case-studies that give an insight into both the game's complexity and R’s simplicity. Virtually no previous knowledge of statistical theory and software is required to master the data analyses and to follow the explications in this book; however, some familiarity with baseball jargon is clearly helpful. It goes without saying that a large amount of sports enthusiasm is necessary to enjoy this entertaining read fully.
The authors accomplish most diverse tasks by applying basic statistical tools: moving averages to explore streaks of hits and outs; linear modelling of players’ career batting trajectories; LOESS smoothing to depict a batter's tendency to swing at pitches dependent on their location. They show how surprisingly easy it is to simulate a half-inning with a Markov chain, or a whole Major League Baseball season with the Bradley–Terry model. In addition, they put a few sabermetric specialities across like the Pythagorean expectation formula, run values and run expectancy matrices. All this results in a convenient toolbox to go into questions that have been bothering most baseball fans once in a while: what effects do certain ballparks have on players’ performances? Do umpires change their behaviour according to the current count of balls and strikes? How exceptional (or, improbable) was Joe DiMaggio's famous 56-game hitting streak in 1941?
The authors’ style of writing is pleasurable and bespeaks their passion for the game. Narratives and R commands are so smoothly intermingled that the source code hardly disturbs the flow of reading, and a wealth of graphs break up the grey. In fact, two entire chapters are dedicated to plotting data and the R graphic packages lattice and ggplot2. A great asset of the book is that it encourages the reader to learn the ropes of sabermetrics by actually running the example analyses on one’s own computer. All data are available from public on-line repositories that hold information at different levels of detail (records per season, game, play or pitch). The authors provide step-by-step guidance on how to access and trawl these databases, how to download portions of data and to handle them in MySQL, and how to reshape data sets directly within R. More than 60 well-wrought exercises further whet the readers’ appetite for launching into their own analyses, and maybe even to transfer some of the ideas to other sports.