Dr. Dobb's Journal February 2004
Playing or watching sports is a popular pastime, although sports fans sometimes disagree about the best team. Consequently, most amateur and professional sports leagues settle this controversy by placing teams that qualify into postseason tournaments so that the outcome of playoff games decide the league champion. You could argue, however, that in any given athletic contest, the better team doesn't always emerge victorious. Still, the final score is the measure used to identify which team played better on a given day, as there is no known objective method to truly determine the better team.
There are many evaluation techniques that attempt to quantitatively determine the relative strengths of athletic teams using only game scores as input. One such simple rating system is the power rating system. Essentially all game scores are examined to calculate each team's average offensive and defensive point totals per game. Subtracting the defensive average from the offensive average yields a team's average point differential per game (OD). (A complete description can be found in The Hidden Game of Football, by Bob Carroll, Pete Palmer, and John Thorn; Warner Press, 1988.) OD is a reasonable measure of how good a team is, though it is unlikely that all teams have played the same schedule of opponents and so the relative strength of schedule (SOS) for each team is computed and added to OD, producing a more accurate team rating. Here is a small example to illustrate the merits of this power rating system.
Let's create six fictitious teamsA, B, C, D, E, and F. Team A always scores 50 points, B scores 40 points, C 30, D 20, E 10, and F never scores. For this example, each team plays every other team once, at a neutral site; Table 1 shows how this system rates these six teams. The differences between any two team's power ratings (SOS+OD) are exactly the relative differences that were used to create the game scores in this example; for instance, team A is rated as 10 points stronger than B who it defeated 50-40, 20 points stronger than C, and so on. This example validates why this is a popular rating system, though there are many variations on how the SOS component is computed.
Because college football does not schedule a postseason tournament, the Bowl Championship Series (BCS) formula is used to identify that year's two best teams. The formula incorporates seven different computer-based rating systems (along with other criteria) to rank the teams. Then, the top two teams are matched in the final postseason bowl game of the year, with the winner recognized as the Consensus National Champion (CNC).
However, some systems do not provide ratings that necessarily reflect which teams have had the best season. For instance, let's examine what transpired at the end of the 2000 season. Oklahoma (OK) was undefeated and Florida State (FSU) had suffered one loss. Using the BCS formula, OK and FSU were selected to play in that year's Orange Bowl for the National Championship. After applying the aforementioned power rating system to all the games between two division I-A teams that year, FSU was ranked highest, whereas OK was ranked thirdalmost 10 points behind FSU and almost 4 points lower than Miami of Florida (who had defeated FSU, but had also lost a game earlier in the year to Washington). Therefore, it isn't surprising that FSU was actually favored to beat OK by 10 points. FSU lost 13-2. However, after adding all such bowl game results to the regular season scores, then recalculating the new set of team power ratings, FSU is still rated higher than OK by a little more than 5.5 points.
Unfortunately, most rating systems can experience this type of anomaly. My idea to prevent this from occurring was to have OK overtake FSU after OK defeated FSU, assuming they had been assigned similar ratings before the game. This idea is incorporated into the ranking system called "Overtake and Feedback" (OAF). Sometimes, systems produce ratings that are overly biased by the margin of victory, especially when the stronger teams run up the score against the weaker ones. The power ratings were recalculated for 2000 using the largest margin of victory to be at most 17, 7, and finally 1 pointthe smallest possible margin of victory. After calibrating the game scores in 2000 to accommodate those three margins, Oklahoma evaluated as #3, 2.7 rating points behind FSU, as #3-0.58 points behind, and as #4 -0.16 points behind, respectively, using the power rating system described. Therefore, it is hoped that the OAF ranking system can eliminate such anomalies. Source code that implements OAF is available electronically; see "Resource Center," page 5.
The basic idea underlying the OAF ranking algorithm is as follows: Each team earns its rating, and this is used to determine that team's unique position, which is known as its "index." The team with the largest rating is the highest ranked team, and is assigned the index equal to the number of teams (N), whereas the lowest rated team is assigned the index 1. After all games played during one week have been used to adjust each contestant's rating, the teams are sorted according to their new ratings, then assigned new indexes (see the text box entitled "Guidelines For Rating/Ranking Methodologies").
This process occurs for every week's games until the final ratings have been computed. These final ratings become the team's initial ratings after a simple normalization step to adjust them allN for the #1 team, and proportionally down to 1 for the Nth teamand the entire season's games are processed again and again until the teams converge to their final positions. Because of this feedback loop, the initial indexes are essentially irrelevant as the best teams work their way to the top very quickly. So initially, the teams are randomly shuffled, and given consecutive integer ratings of N down to 1, for the team starting as #1 down to the number #N team. (The idea for the initial ratings resulted from the article "Predictions for National Football League Games Via Linear-Model Methodology," by David Harville, Journal of the American Statistical Association, Volume 75, pages 516-524, 1980.)
For each and every game, one team's index is guaranteed to be higher than the other. If the team whose index is larger defeats its opponent, its rating increases by a simple function (FW) of the score and the opponent's index ratio; for example, opponent's index/(N+1). The loser's rating decreases using a simple function (FL) of the score and the inverse index ratio; for instance, (N+1-opponent's index)/(N+1). However, if the team whose index is larger loses (or ties) the game, an overtake rule is invoked.
If the difference between the two team's indexes is within a certain range, then the two team's ratings are averaged and the victor acquires that average plus a small additional rating increase of 0.4 as its new rating when the pure overtake rule is applied. The loser's rating becomes the same average minus the 0.4, and so the lower-rated team has overtaken its opponent with that victory. On a tie, the 0.4 increase is subtracted from the lower-indexed team and is added to the higher-indexed team, so the team that was rated higher before the tie still is rated higher than the opponent that tied them. Ties can no longer occur in college football because of the overtime system that was put into place starting in 1996, though ties are present in older data sets. When an upset occurs and the team's indexes are not that close, the amount the ratings change is reduced in proportion to how large the index difference is.
One small exception to when the pure overtake rule is applied is as follows: Say two teams are ranked as the #16 and #17 teams, respectively. (Remember, a team's rank is inversely proportional to its index.) If #17 defeats #16, there may be only a small change of each team's rating using the pure overtake rule, and this amount might be smaller than if FW and FL had been used, as illustrated in the following example. Let's say that the pure overtake rule would add 0.6 to team #17's rating and subtract 0.6 from #16's, but FW would have added 3.0 to #17 and FL would've subtracted 0.5 from #16. In this example, applying FW and FL would have resulted in a larger rating difference between the winning and losing teams, and so they are used instead of the pure overtake rule when this occurs. The formulas for FW and FL are as follows (diff is the margin of victory for that game):
FW =
diff×(indexL/(N+1))2;
[indexL is for the lower rated team that lost]
FL=
diff×((N+1-indexW)/(N+1))2;
[indexW is for the higher rated team that won]
For the pure overtake rule to be applied, a team's index can be no more R positions lower than the team it defeated, where R is the integer portion of
N after rounding. For example, say that N=100, so R=10. If the team with the index of 98 (currently ranked as the #3 team) loses to any team who's index is between 97 and 88, the pure overtake rule will apply as previously described. Any team with an index smaller than 88 is not guaranteed to overtake team #3, but the amount that both teams' ratings change is smaller the further away the winner is from the higher-ranked team. The modified overtake rule is used in these instances; this example illustrates how it works. If the team with the larger index that loses is at indexL, and the victor is at indexW, then the integer portion of K=(indexL-indexW)/R is added to the denominator that is used (2) when the pure overtake rule computes the updated ratings. If the team that is ranked #80 has a rating of 18 before it defeats team #3, whose rating is 90, then K=(98-21)/10=7, and the team with index 98 will have its rating changed to 90-(90-18)/(2+7)-0.4=81.6, while team #80's rating increases to 26.4.
There is just one more consideration to take into accounthow to maintain continuity when applying the modified overtake rule as you move down the list of ranked teams who could be upsetting a higher-rated team. For instance, team #14 defeating team #3 will have K=1, but team #12 would have K=0 because it is within range for the pure overtake rule to be applied. Such a small difference in placement would produce a reasonably large discrepancy between the rating update that is usedthe teams' rating difference divided by 2, or by 3. To avoid this, any team farther away than R uses the larger of two rating adjustments: the one derived in the previous example, or what I call the "previous milepost update." In the previous example (team #3 lost to team #80), the milepost team is #73 who, let's say, has a rating of 30. The index of the milepost team is indexL-R×K, so 98-10×7=28, in this case. If team #73 had beaten #3, then the ratings would be modified by (90-30)/(2+6)+0.4=7.9. You compute both the modified overtake calculation and the milepost update, and apply the larger rating update. Careful observers will have noticed that teams whose indices are at the milepost always use the rating difference /(2+K-1), which lets team #13 use a denominator of 2not 3exactly as the pure overtake rule is described, if they defeated the #3 team.
Every week, a select group of national media professionals cast their ballots after games are completed, as part of the Associated Press (AP) college football poll. A different group, made up of college coaches, votes in the ESPN/USA Today poll. Assuming these voters are extremely knowledgeable about college football, a good measure of any system would be how well it matches those two polls.
Using information found in the Official 2002 NCAA Football Records book, I constructed Table 2, which lists all the known major rating systems that typically only select one champion each year, and have been active for a large number of recent years. (Matthews's system listed "co-champions" one year.) Systems developed by Berryman, Billingsley, Devold, Dunkel, Massey, Matthews, and Sagarin are listed in the first seven columns, followed by the The New York Times's rating system (which began a little later than those seven), OAF, and finally a cumulative score for the first seven columns.
The table lists all the years when both major polls have published their final rankings after all of the bowl games have been completed (except the first, when the #1 team in the AP poll was on probation and, therefore, was ineligible for the coaches' poll). Since 1991, there has been fairly strong agreement by the nine systems listed as to which team was the bestand should have been ranked #1. (Except for the controversy in 1994. In "Who's Number 1 in College Football...And How Might We Decide," by Hal Stern, Chance, Volume 8, #3, Springer-Verlag, pages 7-14, 1995, the author puts forth an argument that perhaps the polls didn't necessarily choose the best undefeated team that year as the CNC. There are many years where one team does not stand out above the rest, as those four years in the '90s, where there was no clear CNC, indicates.)
The bottom row in Table 2 summarizes the overall performance of these rating systems. The total listed represents one point for each year the system matched the CNC, and one point for each of the four years (1978, 1990, 1991, and 1997) when the systems chose either of the polls' #1 teams. As far as which system has best modeled who the polls have selected as the #1 team, Billingsley, Massey, and the OAF algorithm are tied with the highest total22 matches in the 27 years. Devold, Dunkel, and Matthews would be in the next group, and the other three would be in the final group.
One common approach to determine if one system is better than another is to see which one predicts future games more accurately. The one that does is usually considered to represent the team's actual strengths more realistically. Many of the systems used in the BCS formula compute ratings that may allow for the home-field advantage to be included in predicting the outcome of a game, though OAF does not. Therefore, games played at a neutral site (like most postseason bowl games played from mid-December to early January) provide a method to evaluate OAF against the seven systems used in the BCS formula.
OAF was created during July and August 2002, and Table 3 lists the results from the 28 bowl games that were played after the 2002 season ended. The seven BCS systems (devised by Jeff Anderson and Chris Hester, Richard Billingsley, Wes Colley, Kenneth Massey, the The New York Times, Jeff Sagarin, and Peter Wolfe), the Las Vegas point spread line, and the OAF algorithm all predicted the same team to win in 15 of these games, with a record of 10 right and 5 wrong. As the rightmost column in Table 3 indicates, more of these nine systems usually preferred one team over the other, but this majority was less effective than three of the systems, and more effective than four others. As you can see, none of the systems made exactly the same picks over the other 13 games listed. For this year, OAF had two more right than the next highest system's total, but one year's performance is not enough to say it's actually bettersimply that it outperformed the others for that year. (The seven system's ratings were found at http://www.masseyratings.com/cf/compare .htm after the season ended, along with ratings from over 60 other systems. If similar data was available for other years, this comparison could be expanded.) I'll provide a followup in DDJ for the 2003 season after the final game is played in early January 2004.
The OAF algorithm is by no means perfect, but if you wish to see how all division I-A teams are ranked/rated from 1975 to the present, go to http://academics.smcvt.edu/jtrono/OAF.html to examine the correlation between the two major polls' rankings and those generated by OAF. Some situations do arise that violate the spirit of the OAF algorithm, and cause it not to perform as well as I would hope. One situation that can occur is after team A has beaten team B (and overtaken them). If both A and B then win their games the following week, B can move back ahead of A if B defeats a highly ranked team and A defeats a weaker team, even though A demonstrated it was superior to B the previous week.
The 1999 final rankings illustrate another shortcoming. In that year, Illinois (8-4, #24 in the end of season polls) was ranked as the #5 team by the OAF algorithm, right ahead of Michigan (10-2, #5 in the polls) who it defeated that season. Illinois's four losses are too many for them to be ranked so highly in the polls, but because those losses were to other highly ranked teams, that important win increased Illinois' rating more so than the four losses decreased it.
To avoid such occurrences, perhaps FW and FL should not be symmetric; maybe FL could use the inverse index ratio in a more linear fashion. Several other symmetric formulas were tested as the FW and FL functions, but the current ones worked best on the original two test years (1983 and 1975) and have continued to generate acceptable results for subsequent data sets. The value of 0.4 was also chosen early on, and by varying it or the formulas slightly, better results may be achievable. However, the current implementation of OAF has done amazingly well at agreeing with the polls about which team had the best season, and so I hesitate to change it much right now. (Perhaps it should even be considered for the BCS formula!)
One other concern is how many iterations the algorithm takes before the rankings/ratings converge. Some years require less than N (for all teams to stabilize to a final position) where others still have a few teams switching places even after N×
N iterations. Because the final ranking/ratings are somewhat sensitive to the initial team order, 1000 random initial placements are used as well as the N×
N iterations for each of the placements, with the final ranking using the mean rating over all the placements. From 1975 to the present, the highest teams not to stabilize using those parameters were ranked by the OAF algorithm as #14/#15 in 1984. Most fluctuations seem to occur with teams below the highest 25 teams (where N is typically between 100 and 120). To determine one year's rating usually takes over eight hours of CPU time, which is why it is difficult to evaluate subtle changes to the basic OAF algorithm over a large number of seasons.
It is human nature to always be looking for a better mouse trap. Some people feel that there is too much personal bias in the current polling system, and they hope that an objective method could be adopted to determine the CNC because no one can watch all the games of all the teams before casting their vote. No system will ever be perfect, but that won't stop people from searching. (The historical results for the other eight rating systems are not readily available, which is why I settled on tracking the #1 teams in this article. Perhaps some future article/report will have a more extensive evaluation of how well such systems correlate with the intuition captured in the polls.)
DDJ