Friday, August 14, 2015

21st Century NBA Basketball Prediction Program - An Unexpected Detour

  Last month I wrote about the start I had made in recreating the 21st century version of my basketball prediction program. Having entered all the scores and betting lines for the 2014-2015 season, I fleshed out my data views to show season to date point differential averages for all games, all home games, and all road games as well as the last three to ten games for the same three categories. I then added the data fields to indicate whether a team was beginning or ending a series of back to back games, how many days off a team had, and whether a team was playing four games in five days.

  I finished these tasks last week. I was going to start writing a program to test these variables and see how my predictions compared against the official betting lines. The more I thought about the program the more I realized I could write a SQL (Structured Query Language) stored procedure to run against my data. A stored procedure is nothing more than a program that works with data without any user interface considerations. My stored procedure takes parameters to determine the weighting of the different variables that determine my version of ‘power rankings’ to create my own betting line and compare my betting line in each game to the official betting line. My stored procedure ‘picks’ a team depending on how my line compares to the official betting line. If I have a team favored by more than the official line I pick the favorite and when I have the underdog winning or getting fewer points than the official line I pick the underdog. My betting line, the betting line difference, point spread winner, and my pick is appended to a fresh copy of the data for that season and returned from my stored procedure.

  I then wrote a second stored procedure to analyze the results from my first stored procedure. I take the results and summarize it by the difference between my power rankings and the betting lines along with my betting record. In a perfect world there will be a correlation between the difference between the two betting lines and the winning percentage of my bets. There are 1,230 games in a season and my betting line is with three points of the official line over 65% of the time and within five points in more than 90% of the games. The bookmakers betting lines are usually pretty close to the actual game result and when my betting lines are normally close to the bookmakers I take that as a sign that my formulas are on the right track. When my lines are very different I hope it is because I’ve uncovered some bias on the public’s or bookmaker’s part and not found a flaw in my formula.

  I changed my ‘analysis’ stored procedure to read the parameters from a table, call the data gathering stored procedure, and save the parameters, raw results, and summarized results in new table so I can look for any common patterns as to what games I’m missing or getting right and have the results handy.

  I created 30 sets of parameters and fired up my procedures. Within a few minutes I had a renewed appreciation of the power of Structured Query Language and weeks of data for me to peruse. Looking at the results I saw that the worst percentages belonged to the tests that gave the preference to the most recent results and the best results contained a 50-50 mix of season to date and most recent games with a 40% preference given to the home game point differential for the home team and the away game differential when playing on the road.

  Two of the result sets gave me a 63% and 64% winning percentage when my point spread differed from the betting line by more than five points and a 50% winning percentage when my point spread was within five points or less of the betting line. In the world of sports predictions 50% is losing big time because of the 10 percent penalty (or vigorish) on losing bets. It takes 11 winning bets to offset 10 losing bets which makes the betting break-even percentage 52.4%. I could expect a 50% success on basketball games by flipping a coin. Getting 63% on the subset of my picks where my point spread is 5 or more point greater than the established line is as likely as finding a pot of gold at the end of the rainbow.

  The only disturbing feature of my 63% winning predictions was that I was at 50 percent or slightly below when my point spreads differed by 7 to 9 points. This was offset by the higher percentages when the difference was 5 to 7 points and a stellar 80%+ percentage when my point spread was greater than 9 points from the official line. I looked at the games in question and didn’t see anything obvious that would make me doubt the formula but the discrepancy still makes me wonder if I am just experiencing random luck in my predictions.

  I expected to find a working formula sometime in September and have October and November to verify but here I am in the middle of August with not just a working formula but two spectacularly working formulas. Is this a stroke of good fortune or a promising start to a dead end? I can’t say is if the winning formulas are only good for the 2014-2015 season or if the success rate would carry on into other seasons. There is only one way to make this determination and that is to enter another season or two in the database and run the same test over again. I wasn't planning on entering data for different seasons but promise of having a super formula makes it worth the time investment. If I can get a formula that is consistent over a different seasons I’ll feel comfortable with using my program for the 2015-2016 NBA season that is still 10 weeks away.

No comments: