Overtake & Feedback Follow-Up

Dr. Dobb's Journal May 2004

Seeing how it performed in 2003

By John A. Trono

John is a professor of Computer Science at Saint Michael's College. You can contact him at jtrono@smcvt.edu.

As software developers, we realize that deterministic behavior is expected of the artifacts we produce unless, of course, you're creating a solitaire-style card game where it is a must to have a different arrangement of the cards in the deck each time you start a new game. With some specific types of software, it may not be possible to describe the behavior you desire more exactly than, say, "predict when this stock will go up." If you are creating a program that is supposed to simulate a real-world system, you must first design a model that (you hope) reacts in a similar fashion to what the system would do. Anyone who has ever created such a model in hopes of predicting the unknowable future behavior of any complicated system knows how hard it can be to make the model respond appropriately to situations it has never been specifically tested against.

In my article "Applying the Overtake & Feedback Algorithm" (DDJ, February 2004), I described a new model for ranking athletic teams. The ordering produced after applying the Overtake and Feedback (OAF) algorithm correlates well with the two major human polls that decide which team will be officially recognized as the best team after all the games have been played. (You can see how the OAF rankings compare with the end of season polls from 1974-2003 at http://academics.smcvt .edu/jtrono/OAF_CollegeFootballRatings/ college_OAF_Page.htm.) Near the end of that article, I presented how well OAF performed when predicting the winner of the 28 postseason bowl games that were played after the NCAA 2002 regular season, with a promise to inform you about its 2003 prediction accuracy. That summary can be found in Table 2, but I think the data contained in Tables 1(a) and 1(b) is also worthy of some further investigation.

Tables 1(a) and 1(b) record the number of games where two systems disagreed about who would win the bowl games (that were played between two division I-A NCAA football teams) in 2002 and 2003. In 2002, most of the seven systems used by the Bowl Championship Series (BCS) showed small variations in their predictions over those 28 games as indicated in Table 1(a), whereas OAF disagreed with each system on roughly 25 percent of the games to be played. (The seven systems were devised by Jeff Anderson & Chris Hester, Richard Billingsley, Wes Colley, Kenneth Massey, The New York Times, Jeff Sagarin, and Peter Wolfe. I include a column for the team selected by the majority of these seven systems, BCS-7, and one for the Las Vegas point betting line.) However, the bowl games in 2003 brought forth the opposite behavior.

If every system agreed that one team is better than another, one would learn nothing about which system more accurately captures the strength of a team via the ratings it produces. As you can see in Table 1(b), all the systems were disagreeing with each other quite frequently, especially after you consider that for 10 of those 28 games, they all agreed. The matchups in the 2003 postseason provided a significantly larger number of "close calls" to predict than the previous year. For example, the system devised by Billingsley disagreed with the predictions made by Colley's system 12 times, or roughly 66.6 percent of the time (12/(28-10)) once you exclude those 10 games where no information is gained about the relative performance between systems. (You can find the system rankings for teams in those games at http://academics.smcvt.edu/jtrono/OAF_BCS/Ranks2003.htm.)

OAF began the 2003 bowl season correctly predicting 9 of the first 10 games played, but finished exactly as last year: 18 right and 10 wrong. Another system also scored 18 wins, 2 others had 19 correct (like the betting line) and the system of Richard Billingsley finished on top this year, with 20 right. Considering all 56 games (for this year and last), OAF still has the highest number of correct predictions, with 36, but The New York Times is only one behind OAF, and Billingsley is only one behind them. (The lowest total was only slightly above 50 percent at 29 right and 27 wrong.)

Evaluating all of these systems has been an interesting experience, and this year's bowl pairings really put all the systems to the test, which is always what a modeler wants to know: How well does the model work when forced to make tough decisions. Over both years, there were 25 games where all the systems agreed on the "better team," but only 18 of these were predicted correctly, and 7 were not, so it just goes to show you that there is no such thing as a "sure thing" when it comes to predicting who will emerge victorious in any athletic contest!

DDJ