Building an NFL Data Question Answering Bot
It’s Fantasy Football season. I know I could just say it’s Football Season, but that doesn’t really capture why I’m excited. I’m mainly excited to look at a bunch of statistics and pretend I’m Jonah Hill from Moneyball.
For historical reasons, my league uses Yahoo for Fantasy Football (which I assume is the only reason why anyone uses Yahoo). Yahoo has an okay UI for statistics but it’s limited. For example you can’t look at the best players by “targets + carries”, a metric I care about a lot. You also can’t evaluate across more than one season.
So we have a website built for researching statistics but it’s developers didn’t realize that their user is the Michael Burry of Fantasy Football (how many Michael Lewis references am I allowed?). Is it possible that I’ve found the first non-contrived reason to use a large language model? To build myself a stat companion to answer NFL data questions, and free me from my button clicking past?
Admittedly this is feeling pretty contrived, but I’m fully bought in on this AI movement, so I’m going ahead with this project.
Example Questions
- Who are the top 5 players this week by Receiving Targets + Rushing Attempts?
- Which Defense lets up the most receiving yards?
- Which QBs have the most passing yards this week?
- What was Brady and Jalen Hurtss average fantasy points last year?
- Has any QB ever gotten over 500 yards in a game?
- What was the average points scored of the top 30 QBs last year?
Step 1: Can I Code It?
Answering these 6 questions with code took me about 8 hours. Admittedly about 5 of those hours were spent trying to scrape https://www.pro-football-reference.com/. Thankfully I couldn’t get around their throttling and had to explore other options, which led me to nflverse-data. Bless.
With the nflverse data, and a helper library written by cooperdff — who, based on the GitHub heatmap only codes during the NFL season — I can cleanly write code to answer all 6 questions. Here’s an example for my core use case:
- Who are the top 5 players this week by Receiving Targets + Rushing Attempts?
import nfl_data_py as nfl
# Get the current season data
current_season = 2023
df_current_season = nfl.import_weekly_data([current_season])
current_week = df_current_season["week"].max()
df_current_week = df_current_season[df_current_season["week"] == current_week]
df_current_week["targets+carries"] = df_current_week["targets"] + df_current_week["carries"]
display_columns = ["player_display_name", "position", "recent_team", "targets", "carries", "targets+carries"]
df_current_week.sort_values(by="targets+carries", ascending=False).head(5)[display_columns]p
I did the other 5 questions as well which you can run in Google Colab: 5 NFL Questions (Hand Coded).ipynb.
Part 2: Can ChatGPT Code it?
All of my answers were just a few lines of code and were simple to write. I expect we can get ChatGPT to answer these questions. I wrote a bit of helper code, and came up with this prompt describing the use case and the helper code:
You are a question answering agent designed to use python code and pandas to answer questions about NFL statistics. You have access to the following functions.
```python
# Returns the current season
def get_current_season() -> int
# Returns the current week. Only relevant if looking at data in current season.
def get_current_week() -> int
# Returns weekly player stats for a list of seasons. This is the output of a few rows of get_player_stats([2021])
# player_id player_name player_display_name position team opponent_team season week season_type completions attempts passing_yards passing_tds interceptions sacks sack_yards sack_fumbles sack_fumbles_lost passing_air_yards passing_yards_after_catch passing_first_downs passing_epa passing_2pt_conversions pacr carries rushing_yards rushing_tds rushing_fumbles rushing_fumbles_lost rushing_first_downs rushing_epa rushing_2pt_conversions receptions targets receiving_yards receiving_tds receiving_fumbles receiving_fumbles_lost receiving_air_yards receiving_yards_after_catch receiving_first_downs receiving_epa receiving_2pt_conversions racr target_share air_yards_share wopr special_teams_tds fantasy_points fantasy_points_ppr
# 13 00-0030513 L.Murray Latavius Murray RB BUF NYJ 2023 1 REG 0 0 0.0 0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 NaN 0 NaN 2 8.0 0 0.0 0.0 0.0 -0.030657 0 1 2 9.0 0 0.0 0.0 4.0 8.0 1.0 2.01752 0 2.25 0.051282 0.012658 0.085784 0.0 1.7 2.7
# 302 00-0039064 Z.Flowers Zay Flowers WR BAL HOU 2023 1 REG 0 0 0.0 0 0.0 0.0 0.0 0 0 0.0 0.0 0.0 NaN 0 NaN 2 9.0 0 0.0 0.0 0.0 1.086545 0 9 10 78.0 0 0.0 0.0 28.0 51.0 5.0 2.617200 0 2.785714 0.476190 0.269231 0.902747 0.0 8.7 17.700001
# 5436 00-0037834 B.Purdy Brock Purdy QB SF TB 2022 14 REG 16 21 185.0 2 0.0 0.0 -0.0 0 0 110.0 72.0 8.0 8.555145 0 1.681818 2 3.0 1 0.0 0.0 2.0 1.245572 0 0 0 0.0 0 0.0 0.0 0.0 0.0 0.0 NaN 0 NaN NaN NaN NaN 0.0 21.700001 21.700001
#
# This is the output of get_player_stats([2022])["position"].value_counts()
# position
# WR 2237
# RB 1441
# TE 1156
# QB 661
# FB 94
# P 15
# CB 6
# FS 5
# ILB 4
# SS 4
# T 3
# OLB 3
# Name: count, dtype: int64
#
# This is the output of get_player_stats([2022])["team"].value_counts()
# team
# KC 232
# MIA 190
# SF 188
# BAL 187
# PHI 185
# LAC 183
# HOU 183
# JAX 183
# BUF 183
# TB 183
# SEA 181
# DEN 180
# DET 180
# MIN 178
# NYG 177
# ATL 175
# NYJ 174
# DAL 174
# GB 173
# PIT 172
# NO 171
# CIN 169
# LA 167
# CAR 167
# IND 166
# WAS 166
# CHI 165
# TEN 164
# ARI 161
# NE 161
# CLE 158
# LV 153
# Name: count, dtype: int64
def get_offensive_player_stats(seasons: list[str]) -> pd.DataFrame
```
Given a question, please output python code which runs display() on a pandas dataframe to answer the question. For example:
Question: Which QBs have the most passing yards this week?
Answer:
# Get data for this week
df_current_season = get_offensive_player_stats([get_current_season()])
df_current_week = df_current_season[df_current_season["week"] == get_current_week()]
# Get a few relevant columns to QBs
display_columns = ["player_display_name", "position", "team", "attempts", "completions", "passing_tds", "passing_yards"]
# Display the top 10 QBs by passing yards
display(df_current_week[df_current_week["position"] == "QB"].sort_values(by="passing_yards", ascending=False)[display_columns].head(10))
Question: Which Defense lets up the most receiving yards?
Answer:
Without much Re-Work, this prompt is already getting 3 out of the 5 Questions Correct 🎉. This may not seem like grounds for celebration, but I grew up a Bears fan, we take what we can get.
The mistake in Question 2 is thinking that it should tally up the receiving_yards of defensive players. The mistake in Question 5 is not grouping by player.
Next Steps
These outputs show very minimal data, which is actually a bad user experience, because I can’t validate the answer for myself, and because there’s only one insight that can be gleamed from the resulting table. No follow up questions can be answered. What I’d really like to see in response is the supporting data needed to answer my question alongside a selection of other interesting columns, and possibly a concise answer to my question and how it is getting that answer from the data.
I think that ChatGPT may do better at writing SQL queries than python code, and these questions could all be answered with SQL queries. Next step may be to re-do this same process but for SQL.
Takeaways
Prompt Engineering needs better support. It’s difficult to test each new prompt on my test set.
I need to find a good way to tell the model what’s in the dataset. It pretty much always writes working code, but it misunderstands the dataset. I’ve seen other people print the R data output version but that doesn’t do any better experimentally. Possibly showing it a more complicated in prompt example could help.