Loading Retrosheet and Lahman Data Into MySQL

Every year around this time I want to query some baseball data and realize that I haven’t added data from the prior season into my database. In past years, I’d built the databases in a non-systematic manner using non-scripting tools, often with tutorials that were useful but not designed to be repeated as necessary without a lot of pain. I couldn’t find scripts to do this in one place online–there was always some form of point and click, having to go back to sites to download and unzip files, or some other form of annoyance.

I was a bit surprised that I couldn’t find a full scripted solution to rebuilding the databases. Since I have gotten much better with databases and scripting in the past year or so, I threw some files together for others to use, which you can find here on my GitHub–if you stumble across this, hopefully they will save you some time. They are minimally commented, but some basic descriptions are below:

  • The Retrosheet game logs are pretty easy to import, because the text files are all on GitHub. Chadwick Bureau deserve a ton of credit for having moved their tools onto a central repository that far surpasses most other baseball sites in terms of being usable and reproducible. The script gamelogs.R downloads the files for each year and uploads them into a MySQL table.
  • The Retrosheet event files are not as easy to import, so they are split into two scripts. The first, extract_events, downloads the event files and runs the Chadwick utility on them — this is almost entirely taken from a function Jim Albert put on GitHub, though I did parallelize it so it would take a little bit less time. The second takes the extracted files and pushes them through to MySQL (which is where having a greater familiarity with the infile part of that language might be nice, but oh well).
  • The Lahman/Baseball Databank files (I’m not sure of the difference, to be honest) are the easiest to import, since it’s one text file per table. The lahman_import script downloads the most recent version from Chadwick and loads it into the database as well.

One other note of thanks goes to Alex Reisner, whose table creation scripts for Retrosheet I borrowed and modified slightly; not having to play with a couple hundred variable types was very useful.

One of the funny things about this whole process is that it led me to realize that I probably want to move these databases off my machine and into the cloud, but that’s for a different evening or weekend that I decide to spend fighting with databases and APIs. (If MLBAM really wants to advertise for AWS, they can spin up a rich public database that any old schmo can query. Pretty please?)

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s