Analyzing Baseball Data with R (3e) Book Club
Welcome
Book club meetings
Pace
1
The Baseball Datasets
Overview
Baseball terms
Lahman Databse
Lahman from
R
Example Uses for Lahman
Retrosheet Game-by-Game Data
Retrosheet Play-by-Play Data
Accessing the data
Pitch-by-Pitch Data
Statcast
Other data on baseballr
Data used in the book
Exercises
Exercise 1:
2
Introduction to R
2.1
Downloading and using R
2.2
Tidyverse
2.3
Data Frames
2.3.1
Manipulations with Data
2.4
Vectors
2.5
Objects and Containers in R
2.6
Collection of R Commands
2.7
Reading and Writing Data
2.8
Packages
2.9
Splitting, Applying, and Combining Data
2.10
Getting Help
2.11
Further Reading
3
Graphics
SLIDE 1
4
The Relation Between Runs and Wins
4.1
Recency
4.2
Shortened Seasons
4.2.1
Fun Fact!
4.3
Rate Statistics
4.3.1
Win Percentage
4.4
Correlation
4.4.1
Offense
4.4.2
Defense
4.4.3
Run Differential
4.5
Linear Regression
4.5.1
Residuals
4.5.2
Discussion
4.6
Pythagorean Formula
4.6.1
What should the exponent be?
4.6.2
Luck
4.7
Case Studies
4.7.1
2011 Red Sox
4.7.2
Clutch Performance
4.7.3
Great Relievers
4.8
How Many Runs for a Win?
4.8.1
Calculus
4.8.2
Incremental Runs per Win
4.9
Exercises
4.9.1
Exercise 4.1
4.9.2
Exercise 4.2
5
Value of Plays Using Run Expectancy
5.1
Run Expectancy Matrix
5.2
Runs Scored in the Remainder of the Inning
5.3
Creating the Matrix
5.4
Measuring Success of a Batting Play
5.5
José Altuve
5.6
Opportunity and Success for All Hitters
5.7
Position in the Batting Lineup
5.8
Value of a home run
5.9
Value of a single
5.10
Value of Base Stealing
6
Balls and Strikes Effects
SLIDE 1
7
Catcher Framing
7.1
Background
7.2
Framing Examples
7.3
Getting the data
7.4
Where is the Strike Zone?
7.5
Modeling Called Strike Percentage
7.6
Modeling Catcher Framing
7.7
Further Reading
8
Career Trajectories
8.1
Mickey Mantle’s trajectory - Warm up
Fit to parabola
Fit (Mantle)
Plot it
Full fit summary
8.2
Comparing Trajectories
Setting up the data
Compute Similarity Score
Example
Compute Age and OPS for all players / seasons
Fit and plot trajectories
Mickey Mantle
Derek Jeter
Sumarize by peak Age and curvature
8.3
General Patterns of Peak Age
Data preperation
Patterns of peak age over time
Peak age and career at-bats
8.4
Fielding Position
8.5
Discussion points
9
Simulation
9.1
Setup
9.1.1
Retrieve situation states
9.1.2
Sum runs and ID half innings
9.1.3
Meaningful plays
9.1.4
End of innings
9.2
Transition Matrices
9.2.1
Transition states
9.2.2
Absorbing states
9.2.3
Examples
9.3
Tracking Runs Scored
9.4
Simulate Half-Inning
9.4.1
Many Iterations
9.4.2
All baserunner-outs states
9.5
Stochastic Processes
9.5.1
Multiple Transitions
9.5.2
Fundamental Matrix
9.5.3
Visit Frequency
9.6
For Individual Teams
9.6.1
Toward NOBLETIGER
9.6.2
Smoothing Operation
9.7
Team Talent
9.7.1
Bill James
\(\log_{5}\)
model
9.7.2
Bradley-Terry Model
9.8
Make a Schedule
9.9
Compute Win Probabilities
9.10
Simulate Season
9.10.1
Standings
9.10.2
Simulate World Series
9.11
Simulate Many Seasons
9.11.1
Parity
10
Exploring Streaky Performances
Introduction
The Great Streak
10.0.1
Moving Batting Averages
Streaks in Individual At-Bats
10.1
Moving batting averages
10.2
Finding slumps for all players
10.3
Were Ichiro and Mike Trout unusually streaky?
Local Patterns of Statcast Launch Velocity
11
Using a Database to Compute Park Factors
11.1
Introduction
11.2
Connecting R with MySQL using PostgreSQL
11.3
Filling a MySQL Game Log Database from R
11.4
From R to MySQL
11.5
Downloading retrosheet files from 1995 to 2017
11.6
SQL
11.7
Querying Data from R
11.8
Data cleaning
11.9
Coors Field and run scoring
11.10
Calculating Basic Park Factors
11.11
Home run park factor
11.12
Applying park factors
11.13
Exercises
11.13.1
1. Runs Scored at the Astrodome
11.13.2
2. Draw a plot to visually compare through the years the runs scored (both teams combined) in games played at the Astrodome and in other ballparks.
12
Working with Large Data
12.1
Introduction
12.2
Acquiring a Year’s Worth of Statcast Data
12.3
Storing Large Data Efficiently
12.4
Using R’s internal data format
12.5
Using Apache Arrow and Apache Parquet
12.6
Using DuckDB
12.7
Performance Comparison
12.7.1
Computational speed
12.7.2
Memory footprint
12.7.3
Disk storage footprint
12.7.4
Overall guidelines
12.8
Launch Angles and Exit Velocities, Revisited
12.8.1
Launches angles over time
12.9
Further reading
13
Home Run Hitting
Getting the Data
Code for creating the data file
Read in data
Home runs and launch variables
Plot of model
Optimal launch angle?
Temperature effects
Spray angle
Home runs spray vs batting side
Ball park effects (reprise)
Park Factor Plot
Pitcher or batter ?
Fitting the model
Comparing Home run hitting accross seasons
Definitions
Binning launch variables
Function to plot results
Example 2023 HR by bin
2023 HR Rate by Bin
Compare between seasons
Difference in BIP logits
Changes in carry?
Interpretation
14
Making a Scientific Presentation using Quarto
14.1
Sections
14.2
Math (LaTeX)
14.3
Columns
14.4
Tabsets
14.5
Callout Boxes
14.5.1
Collapsible Callout Boxes
14.6
Slideshows
15
Using Shiny for Baseball Applications
Why Shiny
Basics of Shiny
Examples
More resources
Published with bookdown
Analyzing Baseball Data with R (3e) Book Club
Chapter 3
Graphics
Learning objectives:
THESE ARE NICE TO HAVE BUT NOT ABSOLUTELY NECESSARY