Monday, July 27, 2015

21st Century NBA Basketball Prediction Program - Setting up the Data

  Two weeks ago I wrote that I had finished creating a self-signup mechanism for my youth chess tournaments three weeks ahead of schedule and my next project was to re-create the basketball prediction program I wrote in 1987 using Dbase III. I still have my old database and source code but I don’t have a computer that can read the floppy disks they are stored on, a copy of any version of Dbase, or the inclination to even try to use any of these three decades old technologies. What I do have is knowledge of what I did 28 years ago, three more decades of experience writing software, computers that can do in a second what took 1980’s computers hours, and last but not least, the Internet.

  My first step was to create the database to hold games scores and betting lines which was simple enough. I have fields for the season, date, the home team and their score, the away team and their score, the betting line, and two extra fields to note when the game was in a neutral site (in which case both teams could be considered away teams) or if the NBA franchises in New York or Los Angeles were playing each other (in which case both teams could be considered home teams). Since the 2015-2016 schedule won’t be released until next month, I intended to playtest using the 2014-2015 schedule. I quickly found the entire schedule and scores on basketball-reference.com which provided a convenient CSV (comma separated values) export that I was able to massage and plug into my data table in minutes.

  Having saved hours of data entry duty thanks to the power of the internet, my next job was to get the betting lines for each of the 1,230 NBA games that were played last season. This was no easy task since gamblers are a forward looking bunch and past betting recommendations and lines are not generally saved. I couldn’t find them on any of the sites I used when I was making my basketball picks during the season. Just before I was ready to search through old newspapers and microfiche at the library I stumbled upon freeplays.com which has closing betting lines for NBA games starting with the 2002 season! There was no handy download feature like basketball-reference.com so I brought up each day of the 2014-2015 NBA season, entered the betting lines on my spreadsheet, and updated the database from the spreadsheet. Entering each day’s worth of games took between one and three minutes and there is no reason to think my data entry was 100% accurate. After almost seven hours spread out over the past week I have a database of the 2014-2015 NBA season complete with betting lines.

  Thanks to freeplays.com I have the possibility of entering the past 15 years of NBA schedules and betting lines but the data entry would take me around 100 hours. If I have the need to get that information in my database I’ll write a program to scrape the data from the freeplays.com website. For now one season’s worth of data should be enough to test the prediction program that I haven’t written yet.

  My program from 30 years ago assigned home and road power rankings to teams using a weighted average of point differentials from the season as a whole, the point differential of the last x number of games, the season's home or road (depending on current game) point differential and the point differential of the last x number home or road games. I then compared the teams’ rankings to get my point spread and compare it to the spread in the paper. There will be some games where my rankings differ greatly from the published betting line and many more that are very close to the line. My goal is to discover a formula that causes my prediction percentage to have a positive correlation to the difference between my projections and the betting line. I needed at least one month of data to have a large enough sample to create rankings. Running against even this small amount of calculations would take hours using the computers of the 1980’s.

  Last century I wanted to add variables to account for back to back games, four games in five nights, days of rest, the last game of a long road trip, and the start of a home stand. I had the data and the knowhow but not the hours these variables would have added to the tests. The computers of 2015 can easily handle all these variables and more. I spent a few hours on Sunday afternoon adding views to my database to show me point differentials for home games, road games, and the season in total as well as selected increments of past games for trending purposes. Over the next week I’ll add the activity variables (back to back games, four games in five nights, etc...). Once that’s done I’ll work on the rough sketch of an application to run tests using different weightings of these variables. The ideal application will allow me to create a script of multiple scenarios that I can run before I go to bed and check on the results in the morning. In the late 1980's running tests on a few months of data took hours using the standard personal computers of the time. The processing time convinced me to use the first formula I found that gave me a positive correlation (and results better than the magic 52.4% break even mark). In the 21st century these same tests will take minutes if not seconds. This increase in speed gives me the luxury of being able to do more testing and at the same time puts a priority on designing my testing platform to take advantage of the opportunities provided by the increase in speed.

  Once I have the variables and the testing application in place the last step is to create an application that will determine the actual picks. This should be a simple offshoot of the testing program. I’m amazed at how fast this project is coming together compared to my first effort 30 years ago. At the same time I’m not too surprised. Technology has made most things easier and creating a basketball prediction program is no exception.