March 5, 2007

Back to baseball with Retrosheet and OSS

Author: Mat Kovach

One out and the bases are full of Indians. Grady Sizemore steps up to the plate in the bottom of the ninth inning with the Tribe on the wrong end of a 3-1 game against the Kansas City Royals. The first pitch he sees is a ball. After fouling off a pitch and watching the next ball blocked by the catcher he slices a fly ball to left field, clearing the bases of teammates to win the game. As I drove myself home that night I wondered how unusual it was for a game to end on a bases-loaded triple. I figured the best way to look it up was to visit Retrosheet and look through their data files.

Retrosheet is a terrific project that stores play-by-play information for nearly every Major League Baseball (MLB) game played since 1957, excluding 1999. That comes to more than 7,000,000 play-by-play events for more than 93,000 games -- enough data to give me an idea if a bases-loaded triple to end the game is unique. Retrosheet stores its information in event files that describe the game using the Retrosheet scoring system, a markup language to describe all the information related to a baseball game. Event files contain individual files describing a team's home games, with information including the time of the game, weather conditions, umpires, players, plays, substitutions, and even comments about injuries and significant plays.

Armed with Wine, TCL, and my lack of programming skills, I whipped up a script to download all the event files from Retrosheet and convert them into files that I could load into Postgres. I call it CapNKipe, after Duane Kuiper -- a.k.a. Captain Kipe -- a former player for the Cleveland Indians and the San Francisco Giants who hit just one home run in his 3,379 major league at-bats. Since the script would run only once a year, when Retrosheet updated its event files, I figured the name fit.

CapNKipe is a simple script. It uses TCLLIB, a series of extensions for TCL, and TCLVFS, a virtual file system for TCL that provides the ability to mount network services, such as HTTP and FTP URLs and even ZIP files, allowing TCL to access them. Retrosheet provides the event files in ZIP archives. TCLVFS allows the script to work on the files without the user having to worry whether the system has an external program for them.

The only other programs involved, outside the script and TCL, are three tools from Retrosheet. BEVENT.EXE reads the event files and outputs all the events information. BGAME.EXE outputs specific game information from the event files. BOX.EXE prints out the familiar box scores for you. I copy all three tools into a single directory. Retrosheet provides documentation on how to use the tools.

Because Retrosheet's tools are designed for Windows, you must run them under Wine if you're using Linux. An alternative project called Chadwick does a good job of emulating BGAME.EXE and BEVENT.EXE and is open source, but unfortunately, it doesn't support all of Retrosheet's fields yet. I have sent a patch to the author but it has not made it into the code yet.

CapNKipe itself offers many options:

  -eventfiles value Location to store files that are created <eventfiles>
  -zipfiles value Location to store zip files <zipfiles>
  -csvfiles value Location to store csv/psv files <csv>
  -psv Create psv files
  -start_year value Year to start <1957>
  -last_year value Year to stop <2006>
  -retro_url value Retrosheet URL <http://www.retrosheet.org>
  -http_timeout value Http timeout <100000>
  -wine value Location of Wine binary </usr/bin/wine>
  -bevent value Location of BEVENT.EXE <BEVENT.EXE>
  -bgame value Location of BGAME.EXE <BGAME.EXE>
  -debug Debug logging?
  -help Print this message
  -? Print this message

What does CapNKipe do? After downloading an event file, it checks whether a file of the same name exists and the MD5 sums match. If so, it assumes the file has not changed, and skips to the next file. If the file is new or has changed, it will mount the ZIP file using TCLVFS and copy all the files in the ZIP file to the eventfiles directory, as determined by the -eventfiles option, emulating the unzip command. For each year there will be a TEAM<YEAR> file that contains all the teams in the league for that year, and a roster file, <TEAM><YEAR>.ROS, that contains players that played for the team.

The script will also produce the individual files describing each team's home games, seperated by year and team. Individual events files contain the game markups and named based on the year, team, and league -- <YEAR><TEAM>.EVA for American League teams and<YEAR><TEAM>.EVN for National League teams. If you want to look at the Retrosheet scoring for the 2006 Cleveland Indians, the file would be in the 2006 ZIP file named 2006CLE.EVA.

After copying the files out of the ZIP archive, the script calls BEVENT.EXE and BGAME.EXE, using Wine, to create comma-separated files (CSV) and parse the roster and team files into them. It gives the files a label of year-type.csv and stores them in a CSV directory, determined by the -csv option. The produced CSV files contain quoting that can be problematic for programs and database importer utilities. Using TCLLIB's CSV extension, the -psv switch for CapNKipe creates a better CSV solution. The -psv switch tells the script to read the CSV files and split the CSV lines, creating a copy of the CSV files as year-type.psv in the CSV directory. In the PSV files a pipe (|) replaces the comma, and the utility fixes up the quoting, at the cost of a great deal more time to complete. The pipe also helps correct errors where a comma in a number may affect some importers.

The script then repeats the steps for each file in each year. The output looks something like this:

[12/26/06 21:49:11] retrosheet notice Working on roster SLN1957.ROS.
[12/26/06 21:49:11] retrosheet notice Working on TEAM1957.
[12/26/06 21:49:11] retrosheet notice Converting 1957-events.csv to a psv file.
[12/26/06 21:51:48] retrosheet notice Converting 1957-games.csv to a psv file.
[12/26/06 21:51:49] retrosheet notice Converting 1957-rosters.csv to a psv file.
[12/26/06 21:51:50] retrosheet notice Converting 1957-teams.csv to a psv file.
[12/26/06 21:51:50] retrosheet notice Done with year 1957.
[12/26/06 21:51:50] retrosheet notice Year 1958
[12/26/06 21:51:50] retrosheet notice League al
[12/26/06 21:51:50] retrosheet notice Attempting to get http://www.retrosheet.org/1958/1958al.zip
[12/26/06 21:52:01] retrosheet notice Wrote /home/mek/retrosheet/zip/1958al.zip
[12/26/06 21:52:01] retrosheet notice Extracted 1958BAL.EVA

Getting the PSV files into Postgres is easy. I examined the data and generated, by hand, SQL to create a handful of database tables:

I stored the SQL files in the currect directory CapNKipe is working from.

I wrote the SQL to be functional, but each table could use improvement. The events and games tables that the SQL files create are large, and all the tables are a bit ugly. The events table contains 97 columns and the games table has 84 columns. The events table in particular should really be broken down into smaller tables. However, one advantage of the functional but simple SQL code for the data is that it allows the SQL to be used with other database programs. I've tested the code with not only Postgres but also MySQL and SQLite3.

With the table creation SQL code and the data ready, we need to create the Postgres database using the program createdb, which is part of Postgres. After the database is created, loading the SQL routines will create the tables, after which importing the year-type.psv files enters the data. If you have a fresh install of Postgres on your system, you'll have to add yourself as a Postgres user and then create the database:

$ sudo su - postgres
# createuser --no-superuser --createdb --no-createrole mek
# exit
$ createdb -Umek retrosheet

Then, using a small script, create a single SQL file to create the tables and load the data using PostgreSQL's COPY syntax:

$ cat > load-db.tcl << END
#!/usr/bin/tclsh
foreach type {events games teams rosters} {
    puts "\\i $type-create.sql"
}
for {set year 1957} {$year <= 2006} {incr year} {
    foreach type {events games teams rosters} {
        foreach ext {psv} {
            set f [file join [pwd ] csv ${year}-${type}.${ext}]
            if [file exists $f] {
                puts "copy $type from \'$f\' delimiters \'|\' csv;"
            }
        }
    }
}
END

[mek@olin retrosheet]$ tclsh load-db.tcl > load-db.sql.

Running Postgres' command-line program will do the heavy lifting now:

$ psql -Umek retrosheet -f load-db.sql

You are loading over 7,500,000 rows of data, so this will take a while.

Once the load finishes, you might do a few checks to make sure the number of rows are the same in the PSV files and the database tables:

$ for a in events games rosters teams; do
echo "$a lines $(cat csv/*-$a.psv|wc -l)"; done
events lines 7597238
games lines 95764
rosters lines 48808
teams lines 1210

Now check the database counts:

$ for a in events games rosters teams; do
echo "$a lines from db $(echo "select count(*) from $a;" | psql -qAt -Umek retrosheet)" ; done
events lines from db 7597238
games lines from db 95764
rosters lines from db 48808
teams lines from db 1210

Now let's do one more check to make sure the number of games in the games table equals the number of games in the events table:

$ psql -Umek retrosheet -c "select count(distinct gameid) from events;"
 count
-------
 95764
(1 row)

$ psql -Umek retrosheet -c "select count(distinct gameid) from games;"
 count
-------
 95764
(1 row)

Once you have the data loaded and checked, you can enter the database and start looking for data.

$ psql -Umek retrosheet
Welcome to psql 8.1.4, the PostgreSQL interactive terminal.
Type: \copyright for distribution terms
      \h for help with SQL commands
      \? for help with psql commands
      \g or terminate with semicolon to execute query
      \q to quit

retrosheet=# \dt

List of relations
 Schema | Name       | Type | Owner
--------+------------+-------+-------
 public | events     | table | mek
 public | eventtypes | table | mek
 public | games      | table | mek
 public | rosters    | table | mek
 public | teams      | table | mek
(5 rows)

Now we can finally find out about game-ending bases-loaded triples!

There are many fields in the events table. To dig out the right information, the following columns will help:

  • gameid - We need to know the game we are looking at.
  • endgameflag - a flag pointing to the last event in a game.
  • rbionplay - If the bases were loaded, then three runners scored.
retrosheet=# select gameid,eventtype,endgameflag,rbionplay from events
where eventtype=22 and endgameflag='T' and rbionplay=3;

       gameid | eventtype | endgameflag | rbionplay
--------------+-----------+-------------+-----------
 CHA196207011 | 22        | T           | 3
 SLN197105290 | 22        | T           | 3
 LAN197109030 | 22        | T           | 3
 CIN198004140 | 22        | T           | 3
 TOR198709260 | 22        | T           | 3
 CLE200608110 | 22        | T           | 3
(6 rows)

This query shows that game-ending bases-loaded triples don't happen often. Just for comparison let's find all the game-ending grand slams (bases loaded home run: eventtype 23 for home run, and RBIs on play will be four).

retrosheet=# select gameid,eventtype,endgameflag,rbionplay from events
where eventtype=23 and endgameflag='T' and rbionplay=4;

       gameid | eventtype | endgameflag | rbionplay
--------------+-----------+-------------+-----------
 SLN195707300 | 23        | T           | 4
 PHI195807272 | 23        | T           | 4
 BOS195907110 | 23        | T           | 4
 MLN195909090 | 23        | T           | 4
 . . .
 SEA200608080 | 23        | T           | 4
 CIN200606300 | 23        | T           | 4
 SLN200608270 | 23        | T           | 4
(155 rows)

The game-ending triple I witnessed was relatively uncommon. Let's pull up a bit more information about each of those triples. Let's get all the teams involved, the batter, the score, outs, the balls and strikes on the better, who was on base, the pitch sequence (if available), and the Retrosheet event text that describes that play.

retrosheet=# \f ,
Field separator is ",".
retrosheet=# \a
Output format is unaligned.
retrosheet=# select gameid, visitingteam, inning, battingteam,homescore, visscore, outs, balls, strikes, batter,
batterhand, pitcher, pitcherhand, firstrunner, secondrunner, thirdrunner, endgameflag, eventtext,
pitchsequencefrom events where eventtype=22 and rbionplay=3 and endgameflag='T';
gameid,visitingteam,inning,battingteam,homescore,visscore,outs,balls,strikes,batter,batterhand,pitcher,
pitcherhand,firstrunner,secondrunner,thirdrunner,endgameflag,eventtext,pitchsequence
CHA196207011,CLE,9,1,2,4,2,0,0,carrc104,R,bellg102,R,smita102,robif102,cunnj101,T,T8.3-H(UR);2-H(UR);1-H(UR),"
SLN197105290,ATL,9,1,5,7,0,0,0,torrj101,R,upshc101,R,simmt001,aloum101,brocl102,T,T9.1-H(UR);2-H;3-H,"
LAN197109030,CIN,9,1,3,5,1,0,0,motam101,R,gibbj102,L,willm102,valeb102,sudab101,T,T8.3-H;2-H;1-H,"
CIN198004140,SFN,9,1,3,5,2,0,0,colld001,R,laveg001,L,grifk001,kennj101,dried001,T,T8/89.3-H;2-H;1-H,"
TOR198709260,DET,9,1,7,9,0,0,0,benij001,R,noled001,R,thorl001,upshw001,barfj001,T,T7.3-H;2-H;1-H,"
CLE200608110,KCA,9,1,1,3,1,2,1,sizeg001,L,burga001,R,boona001,vazqr001,garkr001,T,T7/F.3-H;2-H;1-H,BF*BX
(6 rows)

Note, I switched the format to unaligned in an attempt to make it readable

By looking at the date in the first field, I can tell that I had seem something that had not occured since 1987. And not just me -- I was one of:

retrosheet=# select attendance,timeofgame from games where gameid = 'CLE200608110';
 attendance | timeofgame
------------+------------
      30929 |        162
(1 row)

30,929 people who spent two hours and 42 minutes at the park that night.

Now let's just see how games end in general:

retrosheet=# select distinct e.eventtype as event_type,
retrosheet-# et.eventname as event_name,
retrosheet-# count(e.eventtype) as end_game_events
retrosheet-# from events e
retrosheet-# inner join eventtypes et on et.eventtype = e.eventtype
retrosheet-# where e.endgameflag='T'
retrosheet-# group by e.eventtype,et.eventname;

  event_type | event_name       | end_game_events
 ------------+------------------+-----------------
           2 | Generic out      | 65162
           3 | Strikeout        | 21589
           4 | Stolen base      | 19
           6 | Caught stealing  | 97
           8 | Pickoff          | 58
           9 | Wild pitch       | 131
          10 | Passed ball      | 25
          11 | Balk             | 11
          12 | Other advance    | 16
          13 | Foul error       | 1
          14 | Walk             | 388
          15 | Intentional walk | 6
          16 | Hit by Pitch     | 41
          17 | Interference     | 1
          18 | Error            | 261
          19 | Fielder's choice | 165
          20 | Single           | 4511
          21 | Double           | 719
          22 | Triple           | 107
          23 | Home run         | 2456
(20 rows)

You might look at this and wonder what the heck is wrong -- six games ended in an intentional walk? What? Was Pete Rose involved? Well, I can point you in the right direction. The SQL code to pull out the gameids would be:

select gameid from events where endgameflag = 'T' and eventtype = 15

It is a logical answer, but I'll keep it secret for now. Feel free to post a comment or email me if you think you know the answer.

So what do you have now? You have a large database that has room for improvement and can be imported into serveral different databases. If you want to learn how to improve database performance or stress test your database server, you have a set of data with which to work. If you're a teacher, this data would also work well for a class on SQL.

Category:

  • Open Source
Click Here!