Over at Matlab Geeks Headquarters, we’ve been busy analyzing the AP and Coaches’ Polls to see how the data look, and interestingly enough, some of my preconceived notions about the Oregon Ducks were confirmed. The final results showed that the top 5 under-rated programs in the country were: Boise State, Utah, TCU, Iowa and Oregon. The top 5 over-rated schools? Florida State, Tennessee, Michigan, Florida and Clemson. Don’t just take my word for it, let’s see how we got these results…
While we had previously posted a spreadsheet with just the rankings, I decided to gather the total votes among all AP voters from the 2000-2009 college football seasons. The spreadsheet with this data can be found here and is courtesy of cnnsi.com and espn.com. Once again, load the data as we did in the first tutorial of this series and follow along to see how we performed the analysis using Matlab. My variable names are the same in Matlab as in the Excel spreadsheet.
First we want to generate a preseason table that lists each teams vote count by year. This can be accomplished by running the following code:
AP_pre_table=zeros(AP_numteams,10); for i = 1:AP_numteams temp_year = year(strcmp(AP_pre,AP_teams(i))); temp_year = temp_year -1999; temp_vote = AP_pre_vote(strcmp(AP_pre,AP_teams(i))); for j =1:length(temp_year) AP_pre_table(i,temp_year(j)) = temp_vote(j); end end
The zeros command initializes the matrix to be all zeros. This is done for two reasons. First, it is computationally faster for Matlab to have the matrix size pre-defined. Second, it assigns each team with a value of 0 if they received no votes. The rest of the code incrementally goes through each team alphabetically, and finds the vote count for each year between 2000-2009. The voting values are stored in the appropriate year column: 1-10 (for 2000-2009), and thus the subtraction by 1999. A similar procedure can be done for the postseason AP voting results.
Now how do you calculate over/under-rating?
AP_diff = AP_post_table-AP_pre_table; AP_diffsum = sum(AP_diff,2); [AP_sorted index_sorted]=sort(AP_diffsum); overratedAP = AP_teams(index_sorted);
The AP_post_table and AP_pre_table should be of the same size, and simply subtracting the preseason votes from the postseason votes will give us the year by year discrepancies in votes. Adding each team’s overrating [(-) score] and underrating [(+) score] will provide us with information on how teams are perceived during the preseason versus how they are ranked at the end of the season. In this case we use the sum function, with 2 as the 2nd input. This tells Matlab to sum across each row instead of summing across each column. Finally, we perform a sort to see in which order teams rank. Again, the whole voting process is subjective, but it gives us a glimpse into how the voters perceive teams.
The results are shown in the figure above. To generate this plot we ran the following commands:
plot(AP_sorted,'.') set(gca, 'XTickLabelMode', 'Manual') set(gca, 'XTick', ) ylabel('Over-rated Under-rated') for i=1:97 if mod(i,2)==0 text(1,AP_sorted(i),overratedAP(i),... 'rotation',90,'position',[i,AP_sorted(i)+100]) else text(1,AP_sorted(i),overratedAP(i),... 'rotation',90,'position',[i,AP_sorted(i)-100],... 'HorizontalAlignment','Right') end end
The set commands allow us to remove the x-axis labels and ticks. (Many other properties for plotting can be set here as well. See our tutorial on plotting for more on this.) We then utilized the text function, along with the rotation and HorizontalPosition properties, as well as the mod function to alternately label the teams on the graph (We also rotated the entire figure 90 degrees clockwise for ease of reading). From these results we see just how far off the extreme teams can be. In fact Boise State and Florida State saw swings of almost 5000 points from preseason to postseason polls over the last 10 years. To be fair, a drop within the ranks of 1-10 will induce greater changes than between 20-30, but there is still a lot of consistency in the way “knowledgeable” voters vote. To investigate the overall relationship of preseason to postseason scores, we can run a correlation:
[R p] = corrcoef(AP_pre_table,AP_post_table)
The results indicate that there is a significant, fairly large correlation in the voting, with R=.628 (P < .001). This seems to indicate that by and large the teams that are favored in the preseason remain in favor among the voters throughout the year. So it’s possible that teams such as Utah in 2008, who begin the season in 29th place with only 53 votes, just have too much ground to make up among all the voters to ever have a chance at the championship.
In fact, looking at all the data, the highest ranking achieved for a team receiving zero votes in the preseason?
high_vote = max(AP_post_table(AP_pre_table == 0)); [high_team high_year]=find(AP_post_table==high_vote); AP_teams(high_team) high_year+1999
Iowa in 2002. The find function is used here to give us two outputs, which provides the row or team, as well as the column or year of the maximum entry. The 2002 Iowa team eventually finished the season ranked 8th with 1334 votes, which is actually quite the accomplishment considering their preseason perception. This rise also factors in significantly into their 2000-2009 underrating. Among the other underrated teams, everyone probably knows about Boise State, TCU and Utah’s consistent rise in the rankings, yet each year these teams begin among the middle or bottom of the pack.
Next week we’ll do a similar analysis for conferences and try to settle the Pac-10/SEC feud, but for now we leave you with this xkcd comic that nicely summarizes our findings