College Football Rankings and How to use Database Queries in Matlab

This week at Matlabgeeks, seeing as one of us is a huge Oregon Ducks fan, we are going to investigate the AP and Coaches polls for college football. As the BCS only reports in midseason and does not do a ranking following the bowl games, we are throwing it out for this analysis. (Note: the BCS is more appropriately named BS, but as this is not a sports blog, I’ll hold off on the playoffs spiel and how the BCS/6 big conferences are just money grubbing the system :). I just want an opportunity to watch the 2001 Ducks team take out Miami).  Instead we want to utilize this opportunity to demonstrate how powerful Matlab is, and how you can use this program to analyze any data set.

The excel sheet we have uploaded includes the preseason and postseason AP and coaches polls from 2000-2009 (college_football).  Since WordPress doesn’t like me uploading .dat files, I’m posting this as an .xls file, but do make sure to re-save the file as tab delimited text.  Big thanks to collegefootballpoll.com and Wikipedia for much of the data as well as several other sources on the web.

While there is no real substitute for an actual database program such as SQL or MySQL, we can still use Matlab to perform searches and queries on a set of data. In this tutorial we will show you some examples of displaying data of interest and the concept of logical indexing.

Here is the code we use to open and load the data of interest:

% College Football Rankings
% Matlabgeeks.com
clear all
close all

fpath = 'C:\Documents and Settings\Vipul\Desktop\MatlabGeeks\';
fname = 'college_football.dat';

fid = fopen([fpath,fname]);
temp_data = textscan(fid,'%f%f%s%s%s%s','HeaderLines',1,'Delimiter','\t');
fclose(fid);

year = temp_data{1};
rank = temp_data{2};
AP_preseason=strtrim(temp_data{3});
AP_postseason=strtrim(temp_data{4});
Coach_preseason=strtrim(temp_data{5});
Coach_postseason=strtrim(temp_data{6});

While textscan is just one way to load data into Matlab, we have decided to use that here since we have some flexibility through its many options. We have chosen to ignore the first line of the college_football.dat file as this is the header, and we have indicated that a tab (\t) will delimit between columns of data.   While there are many snazzier ways to capture the header information, for now we will just save each column of temp_data into the variables year, rank, etc. Finally, the strtrim function is used to remove any unnecessary spaces at the beginning and end of any entry.

Now onto the fun stuff!  Let’s say you wanted to know how many teams are represented by the polls during the preseason and postseason.  Fortunately we have a great built-in Matlab command called unique for this very purpose.

teams = unique([AP_preseason AP_postseason Coach_preseason Coach_postseason]);
teams(1,:)=[];

The results of this query tell us that 76 teams are represented. In 2003, Colorado State was tied for 25th in the coaches preseason poll, and this generates spaces in the remaining columns of that row, which are also counted as an un-named team by Matlab. In order to fix this incorrect report of 77 teams, we can remove this un-named team by assigning the first row to be empty or []. Assignment to null such as this is a great way to remove rows or columns in any Matlab array.

What if you are curious about which team finished the season 2nd in the coaches poll during these 10 years:

Coach_postseason(rank==2)

This is an example of logical indexing in Matlab.  The rank==2, will provide a 0 or a 1 to Coach_postseason, with only the columns that match the true (1) values returned.  Amazingly, USC finished 2nd in the Coaches poll four times during this period, with Matlab returning the following results:

'Miami (FL)'
'Oregon'
'Miami (FL)'
'USC'
'Auburn'
'USC'
'Ohio State'
'USC'
'USC'
'Texas'

Now, while that might be an interesting query of the data, maybe we want to output multiple columns of interest to a particular query.  In this case, maybe we want to know both the final AP ranking of Oregon as well as the year in which that ranking was held.  In this case we would perform the following:

[year(strcmp('Oregon',AP_postseason)) rank(strcmp('Oregon',AP_postseason))]

In this instance we utilize an array as output ([]) and use the command strcmp to find instances of Oregon in the AP_postseason variable. Again the strcmp will return the boolean true or false for each row, and the final output uses this as a logical index to only return the year and rank of Oregon in the postseason AP poll.  Matlab should return the following:

2000           7
2001           2
2005          13
2007          23
2008          10
2009          11

Finally, we’ll leave you with a for loop that allows you to find the total number of times each team shows up in the AP preseason rankings. We will use this information as we continue into our next post:

numteams = length(teams);
AP_pre_occurence = cell(numteams,2);
for i=1:numteams
    AP_pre_occurence(i,1) = teams(i);
    AP_pre_occurence(i,2) = {num2str(sum(strcmp(teams(i),AP_preseason)))};
end

As you can see, we preallocate our cell array using the zeros function. We convert the number into a string for the sake of keeping the AP_pre_occurence as a string array and we’re done. I won’t display the output here, but only Oklahoma, Ohio State and Texas showed up in all 10 preseason polls. Until next time, Yay Matlab.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.