Analyzing Spread Football Picks With R

I’ve been making an effort to learn R for about a year. I have experimented with it on and off over the years, but this is first serious effort I’ve been making.

Whenever I am learning something, rather than just focusing on book examples, I try to come up with an example that is relevant to me and interesting. Doing that helps keep me motivated, and drives me to pick the things I want to know that are useful, and not just focus on the things that are revealed through examples. I would liken this to an experiment I heard a Khan Academy engineer talking about where students are exposed to various Logo drawings. Some of which have the source code available and some don’t. The ones without source serve as motivation and focus on principles that will build and challenge on what the student already knows.

In my case this means the following: if I’m going to work with Linear Models in R, I’m not just going to work with an example that lends itself to that data, but to be challenged to evaluate with variables might make a valid model and then test that fit with a critical eye.

In my case, I decided to try to not be the worst in my NFL Pick’em league this year. I usually do ok in the league, but I’m having a particularly bad year. The premise of the league is as follows. This league is only about picking game results, not like fantasy football. You pick every game each week, and you pick against the spread. The most correct picks win.

For those who don’t know what a spread is: it’s a gambling mechanism to get people to bet on both sides of a game. Bets are like stock purchases, you may not think about it every time you make a transaction but there needs to be someone taking a position on the other side. Many people assume the casino (or bookmaker) is taking the other position. They are trying not to take a position, they are really just a market-maker. The bookmaker attempts to make money by having a small profit margin (sometimes called overround) on the bets. In order to not take a position, they want as close to 50% of the betting population on each side of the bet. That way, each winner is paid using the losses of a loser. In order to accomplish that, they use a spread, or payout odds. In the case of a spread, they subtract a certain number of points from the favorite, to entice people to bet the underdog. If Denver plays Oakland, and the spread is Denver – 10, the bookmaker is saying that by subtracting 10 from Denver’s score, they think they will get an even market. If Denver wins by more than 10, the Denver betters are right. If they win by less than 10 or lose the game outright, the Oakland betters win. If Denver were to win by exactly 10, the bet is a push, and both sides get their original bet back.

Our league is not a gambling league in the sense of betting per game. You just pick all the games, and there is a prize at the end of the year for the most correct picks. It is run by a friend and I have been in the league for over 10 years now. So needless to say, I know the domain. Which in doing analysis is a huge leg-up. You can intuit pretty quickly if numbers look correct, or if a stat has meaning.

So the first step was to track results. I used google docs to keep track of my picks. It has an option to download spreadsheets as a csv file, which is a very friendly format for R to work with. If you want to try this with your picks, you can make a copy from here.

Now comes the R work. All the code is up on my github account. The first step was to get the data into a data frame, one of R’s most common structures. Picks.R does just that, and adds some calculated columns and gets calculates some general league trends. I wrote two functions condition_frequency and condition_percentage that can calculate almost all the required stats. They are functions that count the number of occurences of some condition, or a percentage. Both take functions for the condition, and can look at all picks, or be passed another function that is used to determine a subset to analyze. For instance, you can calculate the percentage of home teams that cover the spread when they are favored by passing a set condition that looks for results where the home team is favored, and a subset condition of the home team winning by more than the spread.


Describe.R writes a markdown file that can produce html to show league trends and personal trends. The result looks like:

Next I decided to plot my results by week. The results are:

My results by week

You can see I tried to apply a simple linear model to the results based on how many weeks of football I had to project how much better picking would get. That’s a questionable model to try, but it at least demonstrates your general trend.

In Teams.R there is a function unplayed_games that will give you relevant stats about each team in the games that don’t have scores yet.

So what did I learn?

I learned to use functions very effectively in R, and to try to take advantage of the way you can operate on entire vectors at the same time. (Data frame columns are vectors). I learned to work with Hadley Wickham’s dplyr and ggplot2 libraries, which are great for productivity once you understand the philosophy of how to work with those libraries.

A lot of the visual and transformation work was helped by a workshop the Cleveland R User Group held with Robert Kabacoff. He was a very good instructor and it really put a lot of pieces together for me about working with R.

What Next?

I’d like to get into clustering the data, and seeing how results vary by spread size. In addition, I’d like to try some machine learning. Train up models and see if the machine can predict better.

In particular, I’d like to bring team popularity into the model. Why? Remember the long-winded discussion of how and why bookmakers make spreads? Did you notice that the bookmaker isn’t trying to predict the most accurate line, they are trying to get 50% of the betters on each side. That means that there are opportunities for exploitation. The common example is large market (or popular) teams. Consider the Pittsburgh Steelers (which as a Bengals fan, I of course loathe but that is not the point…): The steelers have backer groups across the country and a huge following. If they were to play a team like Jacksonville that struggles to sell out their tickets, it is likely that there is a certain base that is going to bet on the Steelers simply because they are fans. In order to achieve that 50% balance, bookmakers are likely to skew the spread to overly favor Jacksonville. To make the less popular team a more attractive bet. Savvy data driven pickers end up taking the mathematical advantage at the expense of betters just playing favorites.

Also, I’d like to investigate ways to make the entire app more approachable. Could this be a shiny app that takes a url to a csv and present the user with results?

It’s been a fun project, and I’ve seen some improvement over the year. That said, I’ve had a rough picking year and certainly won’t finish in the money. But it’s kept my R learning journey moving along, and I’ve enjoyed it.

Running a Local ElasticSearch Cluster for Development

ElasticSearch is a document database built on Lucene, a full text-search engine. It clusters and is useful in a variety of scenarios. If you want to run it locally and test some of the clustering feature, here are some things I learned from my experience.

Install with your preferred package manager, or from source. In my case I use homebrew, so install is as easy as:

brew install elasticsearch

You can run multiple nodes from one install. First, you will want to tweak the config. For me, the elasticsearch.yml config file was located in /usr/local/Cellar/elasticsearch/1.4.0/config/, but that may vary based on your OS, package manager, or version.

Mainly, I set a value for cluster. name: some_cluster_name_of_your_choosing
marvel.agent.enabled: false

By default, elasticsearch joins any cluster with the same name, so you do not want to run on the default or you will be syncing with other local developers on your network.

Then you want to configure your startup mechanism. For some users, that would mean configuring the elasticsearch file in /etc/init.d, but for OS X homebrew users, my startup file is ~/Library/LaunchAgents/homebrew.mxcl.elasticsearch.plist. First, setup extra nodes. In OS X, that means making copies of the LaunchAgent file in the same directory with unique names. It’s worth noting if you’re using homebrew, that the original LaunchAgent plist file is a symlink and you’ll want to copy the contents to a new file. I numbered my extra nodes, so the filenames were homebrew.mxcl.elasticsearch2.plist, and so on. I altered the plist file to have a couple of extra arguments, specified in the ProgramArguments node. I removed the xml nodes that specify keeping the worker alive. The results were:

I set custom node names and ports via the plist file. Note that on a Linux based machine, you would later the service to run multiple ElasticSearch nodes with those custom parameters.

Finally, since I didn’t want to run those scripts individually each time, I create a script to launch all nodes at once. Note that you’ll want to add execute permissions and put it somewhere in your path.

Finally, I recommend installing Marvel, particulary for Sense, a nice tool for running commands on ElasticSearch. You can install Marvel with this command in the ElasticSearch root directory (for me, /usr/local/Cellar/elasticsearch/1.4.0/):

bin/plugin -i elasticsearch/marvel/latest

Now you can bring the cluster up with es_cluster


Thoughts On Leadership and Groups

Building software teams is hard. Fostering culture, improvement, learning and community in a group of individuals that have other options is a difficult thing to do. Fortunately, it’s not all that different than building many other kind of groups. Yet too often, we fail to look around at other successful groups and learn from their example. There are two challenges in particular that hold back leaders in their quest to build a highly functioning group.


First, creating success is not about the composition of the group alone. Don’t get me wrong, you should look for A-Players, but in any environment where there are other groups people will be more or less evenly distributed by the draw of money, opportunity, ego-stroking, etc. And what will differentiate your group is what it can do with it’s B and C-players. Your A-Players are already good, and mostly know how to handle themselves. Leadership might provide them some marginal returns. But compare that to the return of turning a C player into a B player with proper support. Yet how many of us see leaders trying to run off everyone but A-Players in the naive belief that they can build a team of only A-Players. Those teams don’t exist, and if they do, they don’t need leadership.

This is why some really successful leaders are viewed as simplistic, or unintelligent. Let me cite some examples. Jim Tressel had great regular season success with the OSU Football team, but struggled at times to win the big game. This was often chalked up to inferior coaching strategy. Critics said he was too conservative with play-calling, and that he harped on the basics instead of explosive plays. He was playing the numbers. He made some flawed teams better by reducing mistakes. There may be some truth that he could have opened up the playbook more and found some new strategies for the big games, but ultimately I think the key differences in the conferences that have plagued most outside of the SEC showed up those games. Urban Meyer never had that reputation at Florida, but is now turning out mostly the same results.

Another example is the financial advisor Dave Ramsey. He is trying to lead financial change across huge groups of readers and people attending classes in their community. His system is very simple and is critiqued for that lack of sophistication. But he has succeeded in helping millions of people (who span across a wide range of intelligence and financial knowledge) out of debt. I challenge the best financial Wall Street consultant to do the same. He tuned the message to the B & C players. The folks who were the most capable were on there way to success when they picked up a book and started thinking about fixing their finances and doing some basic tracking and planning. That was the nudge they needed. And the folks who wouldn’t understand a financially complicated plan got something they could digest that was better than what they are doing today.


Second, creating success is not a set of linear steps that reaches a done phase. Leaders lay out short-term plans that are focused on fixing all the problems, and then thriving in some utopian state. Like creating a group is like a construction problem, when in fact, it’s more like owning the property. You have to maintain the building. You have to weed the garden. You have to pay the bills…

Leadership is a repetitive job, where you will fix the same problem more than once, and you can’t get impatient about that. You will tread over the same ground, sometimes with the same people. If this doesn’t make sense, go ask a minister when his or her church will be “done”. Ask them when the last time they will need to do a baptism is, or when the last time is they will need to comfort a grieving family. Ask a coach when his or her team will be finished and perfect.

Who Should Lead

Now we come to some of the biggest sources of confusion. The mistaken belief that the best welder should lead the welders. The best burger flipper should manage the restaurant. The best programmer should lead the team. How many times have you seen that tried and failed.

Leadership is a service job. It is taking responsibility and solving problems. It is building community and motivating growth. Leading is about understanding people, and more importantly group interaction. Understanding the domain is usually the easy part.

Facebook Gender Analysis With R

I like working with social APIs, and have been working with R more lately. So I combined the two.

I saved the json results of asking facebook for all of my friends using the graph api.

Using the rjson package for R, I loaded the data into R and broke down the count of friends by gender. I then created a simple bar plot.

It wasn’t rocket science, but a fun project to toy around with and get to know manipulating data in R.


VicinityBuzz Update: Windows Phone 8 & More

VicinityBuzz on Win Phone 8

While attending Codemash a few weeks ago, I ended up in a Windows Phone development precompiler (Codemash’s name for a training session). It was my plan to hit mostly mobile and analytics sessions, but I was not originally planning on attending this session. With Windows Phone still struggling for market share, I wasn’t in a rush to work with it. However, other sessions were cancelled because weather had delayed some presenters, so I ended up in this session. Microsoft’s Jeff Blankenburg was teaching the session, and I have enjoyed some of his presentations and a Silverlight fire-starter event in the past. It’s one of my rules of conferences to attend sessions based more on good speakers, rather than based solely on topic.

With regards to marketshare, Jeff made the point during the session that with a less crowded app store, you do have a bit more discoverability. Even if that doesn’t hold up, the platform shares enough similarity with Windows 8 that a port to the Windows Store will be trivial. The Windows App Store isn’t exactly setting the world on fire either, but I’d like to see my app on all of these platforms, and as Windows 8 adoption rises with new machine sales, that marketplace should see constant upticks.

Having worked with Silverlight in the past, I found it pretty easy to get going on Win Phone 8 development. There was some definite rust on my XAML skills, but it came back to me fairly quickly. One thing to keep in mind is that you want to keep things relatively simple on a mobile platform. I have worked on some WPF projects in enterprise settings with MVVM frameworks, dependency injection frameworks, and more. While I followed an MVVM pattern, I just rolled my own with a simple base class.

My project was to do a version of an app I already have in the iOS App Store, VicinityBuzz. It does location based searches of twitter. You can search around you, or by entering an address. The radius is a configurable setting. I like using the app at conferences like Codemash to catch all the chatter that may not have a hashtag. One catch is that obviously only tweets that included location will be found. If folks have that feature turned off in their twitter app, then it won’t show up.

Since I had written the app before (in phonegap for iOS), I knew the feature set and domain cold. The challenge was just getting up to speed with the latest API’s for search and geolocation, and then implementing within a new platform. One of the biggest benefits of this project was getting up to date with the latest Twitter API. I still need to update the version for iOS, as it’s currently non-functional because of api changes over the last several years. I plan on doing that very soon now that I know the latest version.

Anyway, I won’t go into the development details here too much, but I finished a version 1 of VicinityBuzz, and it is now in the windows phone store here, and it’s free. So go check it out. If you like it, I’d love to have some more reviews.

Also, if you are inspired to do any Windows Phone development yourself, you may be interested in a device to do some real testing. I recently found there are some prepaid phones new on Amazon that are dirt cheap for that purpose. Check out the Nokia Lumia 520 and Nokia Lumia 521 on Amazon.

Watch this blog for upcoming posts about working with the Twitter API, and some of the things I learned working with Windows Phone 8. And more mobile in general. I have the bug again…