Using AI to make sure I don’t lose my Fantasy Basketball League next year

Aryan Jha
8 min readOct 13, 2024

--

I would say that I have above-average basketball knowledge, especially compared to my friends.

However, this didn’t stop me from coming 5th out of 12th in our fantasy basketball league last year. While some of it was outside my control (like Ja Morant playing a grand total of 9 games), I feel like I could have played the free-agent market better with the revolving door of players that was the 2023–24 Memphis Grizzlies.

A picture of slightly more than half of the 33(!) players that suited up for the Grizzlies last season. (credits: u/loooeeee on Reddit)

Part of this was that my friends were too scared to drop their players who were consistently doing average for a player that could go for 40 points or 0 points on any night. However, since I had fallen behind at the start, I was willing to take on any risk, especially when the reward could be so high.

Given a list of 5–10 players on the free agent market that would play on a given night, how would I know which ones would do the best?

I knew the solution: I would use AI.

Unfortunately, with school and other things getting in the way, I thought I didn’t have the time to create a machine learning model that could predict which players would do well, especially by the end of the season. Even though it was a bit late, I decided to try it anyway this summer, just to see if the idea would work.

First, I started with the data collection. I knew I couldn’t manually compile the dataset, so I set out to find an API that could do that for me. I eventually found nba_api, which allowed me to access the box scores for any game with a rate limit large enough to work without large delays.

from nba_api.stats.endpoints import leaguegamelog
games = leaguegamelog.LeagueGameLog(player_or_team_abbreviation='T', season="2023-24", sorter="DATE")
gdict = games.get_dict()
game_id = gdict['resultSets'][0]['rowSet'][0][4]
games.get_data_frames()[0]

The output is a list of all the games played by the Memphis Grizzlies, with the team’s stats.

We only need the game_id here, as we will use it to find each player’s stats.

getting = True
game_ids=[]
i = 0
while getting:
try:
if gdict['resultSets'][0]['rowSet'][i][3] == "Memphis Grizzlies":
game_ids.append(gdict['resultSets'][0]['rowSet'][i][4])
i += 1
except:
getting = False

f = open("game_ids.txt", "w")
f.write(str(u_game_ids)[1:-1])
f.close()

This creates a list of all the game_ids that are then outputted to a game_ids.txt. This is so I don’t have to repeat any API calls when running the program again, as it can result in a rate limit during a future step.

from nba_api.stats.endpoints import boxscoretraditionalv2
BoxScore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id="0022300071")
BoxScore.get_data_frames()[0]
BoxScore.get_dict()

We can use the BoxScoreTraditionalV2 endpoint to get the individual stats of the players who played during the game. This is very useful for me, as it allows me to access each player's stats for each game played by the Grizzlies.

f = open("game_ids.txt", "r")
game_ids = []
game_id_values = f.read()
game_id_values = game_id_values.replace("'", '')
game_id_values = game_id_values.replace('"', '')
cleaning = True
while cleaning:
try:
ind = game_id_values.index(",")
game_ids.append(game_id_values[0:ind])
game_id_values = game_id_values[ind+1:]
except:
cleaning = False
for i in game_ids:
game_ids[game_ids.index(i)] = i.replace(" ", '')

Since the game_ids.txt file is plain text, I have to do some processing to make it into a Python list.

from nba_api.stats.endpoints import boxscoretraditionalv2
import time

total_stats = []

for i in range(82):
time.sleep(1)
BoxScore = boxscoretraditionalv2.BoxScoreTraditionalV2(game_id=game_ids[i-1])
inputting = True
stats = []
box = []
i = 0
while inputting:
try:
if "MEM" in BoxScore.get_dict()["resultSets"][0]["rowSet"][i][2]:
if "-" in BoxScore.get_dict()["resultSets"][0]["rowSet"][i][8]:
box = [BoxScore.get_dict()["resultSets"][0]["rowSet"][i][5]]
else:
box = [BoxScore.get_dict()["resultSets"][0]["rowSet"][i][5]]
box.append(BoxScore.get_dict()["resultSets"][0]["rowSet"][i][9:29])
box[1][0] = int(box[1][0][0:box[1][0].index('.')])
stats.append(box)
i += 1
except:
total_stats.append([stats])
inputting = False

Using the list, I can now use the API to get the stats for each game and separate them by player. It creates a list of player names and their stats for each game.

names = ['Santi Aldama', 'Timmy Allen', 'Desmond Bane', 'Bismack Biyombo', 'Brandon Clarke', 'Tosan Evbuomwan',
'Wenyen Gabriel', 'Jacob Gilyard', 'Jordan Goodwin', 'Shaquille Harrison', 'Matt Hurt', 'GG Jackson',
'Jaren Jackson Jr.', 'DeJon Jarreau', 'Trey Jemison', 'Luke Kennard', 'John Konchar', 'Jake LaRavia',
'Kenneth Lofton Jr.', 'Ja Morant', 'Jaylen Nowell', 'Maozinha Pereira', 'Scotty Pippen Jr.',
'David Roddy', 'Derrick Rose', 'Zavier Simpson', 'Marcus Smart', 'Lamar Stevens', 'Xavier Tillman',
'Yuta Watanabe', 'Jack White', 'Vince Williams Jr.', 'Ziaire Williams']
team_outs = [[1]*82 for _ in range(33)]

I want to track which players are out, so I create 33 lists of 82 “1”s, where a 1 symbolizes that the player did not play in that game.

for i in total_stats:
j = total_stats.index(i)
for k in range(len(i[0])):
for l in names:
if l in i[0][k][0]:
team_outs[names.index(l)][j] = 0

I check the total_stats to find the players from each game. If the name of a player is found, then I update the list to show that the player was present in the game.

def AddOuts(data, game):
out_data = [data]
for player_outs in team_outs:
out_data.append(player_outs[game])
return out_data

player_data = []
total_data = []
count = 0

for i in total_stats:
for j in i:
for k in j:
player_data = [count]
player_data.append(k)
player_data = AddOuts(player_data, count)
total_data.append(player_data)
count += 1

We now have a list of each player’s stats for each game as well as the playing status of every other player.

def flatten(data):
flat = []
for i in data:
if isinstance(i, list):
flat.extend(flatten(i))
else:
flat.append(i)
return flat
t_data = []
for i in range(len(total_data)):
t_data.append(flatten(total_data[i]))

To make the list easier to use, it is flattened so there aren’t multiple layers of lists (ex. [0, [1, 2, 3, [4, 5, 6]], 7] becomes [0, 1, 2, 3, 4, 5, 6, 7]).

for i in range(len(t_data)):
for j in names:
if t_data[i][1] == j:
t_data[i][1] = names.index(j)

For the machine learning model to understand which player is which, we can’t keep their names as a string. This would find the name in the names list and replace it with its position.

Currently, the issue is that it would be hard to predict a player’s exact stats given the players that are not playing. Instead, it would be better if we could compare their stats to their averages, so we can see if they do better or worse.

from nba_api.stats.static import players
from nba_api.stats.endpoints import commonplayerinfo
import json

def get_averages(name):
averages = []
d = players.find_players_by_full_name(name)
id = d[0]["id"]
jsvalues = json.loads(commonplayerinfo.CommonPlayerInfo(id).get_json())
for i in range(3,6):
averages.append(jsvalues['resultSets'][1]['rowSet'][0][i])
return averages

team_averages = []
for i in names:
team_averages.append(get_averages(i))
team_averages

This uses the CommonPlayerInfo endpoint to get the averages of each player and put them in a list. This endpoint only allows us to get the average points, rebounds, and assists for a player, so that is what we will be working with. Unfortunately, this does mean that players with more defensive impact will be overlooked, but that might be something to explore at a later time.

def compare(stats, average):
comp = []
for i in range(3):
if stats[i] > average[i]:
comp.append(1)
else:
comp.append(0)
return comp

This takes the stats of one game and the player’s averages and creates a new list, with a 1 symbolizing a better performance in that category compared to their average, and a 0 symbolizing the opposite.

a_data = []
for i in t_data:
temp_data = []
temp_data.append(i[0])
temp_data.append(i[1])
temp_data.append(compare([i[14], i[15], i[22]],team_averages[i[1]-1]))
temp_data.append(i[22:])
a_data.append(flatten(temp_data))

This creates a similar list to the t_data one from before but replaces the points, assists, and rebounds with the comparison numbers from above.

import pandas as pd

labels = ["Game", "Name", 'REB', 'AST', 'PTS', "Santi Aldama Out",
"Timmy Allen Out", "Desmond Bane Out", "Bismack Biyombo Out", "Brandon Clarke Out", "Tosan Evbuomwan Out",
"Wenyen Gabriel Out", "Jacob Gilyard Out", "Jordan Goodwin Out", "Shaquille Harrison Out", "Matthew Hurt Out",
"GG Jackson II Out", "Jaren Jackson Jr Out", "DeJon Jarreau Out", "Trey Jemison Out", "Luke Kennard Out",
"John Konchar Out", "Jake LaRavia Out", "Kenneth Lofton Jr Out", "Ja Morant Out", "Jaylen Nowell Out",
"Maozinha Pereira Out", "Scotty Pippen Jr Out", "David Roddy Out", "Derrick Rose Out", "Zavier Simpson Out",
"Marcus Smart Out", "Lamar Stevens Out", "Xavier Tillman Sr Out", "Yuta Watanabe Out", "Jack White Out",
"Vince Williams Jr Out", "Ziaire Williams Out"]

df = pd.DataFrame(data=a_data, columns=labels)
df.to_csv('data.csv')

Finally, we output the dataset with the correct labels. Now, we’re done the preprocessing step and we can move on to the actual machine learning part.

import pandas as pd
import numpy as np

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.model_selection import train_test_split

df = pd.read_csv("data.csv")
labels = df.iloc[:, [2,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]]
data_reb = df.iloc[:, [3]]
data_ast = df.iloc[:, [4]]
data_pts = df.iloc[:, [5]]

Here, we import the relevant packages and split the dataset into its parts. Since we are trying to predict 3 different things (points, rebounds, and assists), I’ve decided to create 3 different models, which is why there are 3 different sets of data.

X_train, X_test, y_train, y_test = train_test_split(labels, data_pts, test_size=0.2)

For the points model, the data and labels are split with 80% as the training data and 20% as the testing data.

model_pts = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.05),
Dense(256, activation='relu'),
Dropout(0.05),
Dense(256, activation='relu'),
Dropout(0.05),
Dense(1, activation='sigmoid')
])

model_pts.compile(optimizer='adadelta', loss='binary_crossentropy', metrics=['accuracy'])
model_pts.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

The model is then defined, compiled, and trained on the dataset. 20% of the training data is used for validation, so the model can test itself after every epoch of training. This helps tune the model during training, and results in higher accuracy.

test_loss, test_accuracy = model_pts.evaluate(X_test, y_test, verbose=1)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

Finally, we can see how the points model fares on the test set. Accuracy values can range between 75% and 81%, but I have saved a version of the model that has achieved 80.4% accuracy on my Github.

X_train, X_test, y_train, y_test = train_test_split(labels, data_ast, test_size=0.2)

model_ast = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.01),
Dense(64, activation='relu'),
Dropout(0.01),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])

Again, we do the same with assist data.

model_ast.compile(optimizer='adagrad', loss='binary_crossentropy', metrics=['accuracy'])
history = model_ast.fit(X_train, y_train, epochs=150, batch_size=64, validation_split=0.2, verbose=1)

The assist model has less accuracy, mostly because assists can be very volatile. Most players don’t focus on assists, and they can increase or decrease by chance. However, the model I saved still had an accuracy of 62%. This means it’s slightly better than random chance, but it’s nothing to be proud of. Even though it isn’t great, it’s still

X_train, X_test, y_train, y_test = train_test_split(labels, data_reb, test_size=0.2)

model_reb = Sequential([
Dense(64, input_dim=X_train.shape[1], activation='relu'),
Dropout(0.1),
Dense(128, activation='relu'),
Dropout(0.1),
Dense(64, activation='relu'),
Dense(1, activation='sigmoid')
])

model_reb.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model_reb.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2, verbose=1)

Finally, we load the model and data for the rebounds model.

test_loss, test_accuracy = model_reb.evaluate(X_test, y_test, verbose=1)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

The rebound model is much more accurate, and my model has an accuracy of 90.2%. This is the most reliable model by far.

So, in summary:

  1. I got the game IDs for each game played by the Grizzlies.
  2. I got the box score for each of these games.
  3. I separated each player’s stats.
  4. I found which Grizzlies players were playing in that game.
  5. I created a data set with the players, their stats from that game, and indicators of whether or not each player on the roster had played that game.
  6. I trained 3 different machine learning models to predict whether the points, assists, and rebounds were higher than the player’s average given the data about which players did not play.
  7. The points model had an accuracy of 80.4%. The assists model had an accuracy of 62%. The rebounds model had an accuracy of 90.2%.

If you want to run a web version of the model, check out this project’s page on my website.

Overall, while these models may not help me directly in next year’s fantasy basketball league, I’ll try a similar method as the season goes on. I hope you guys learned something, and I’ll see you again when I discuss my next project.

--

--