I made Claude, GPT and Gemini predict the World Cup. I'm keeping score live.

I had a dumb question stuck in my head: if I ask the three best models to predict the World Cup, which one does best? Not in theory, not on some lab benchmark — on real matches, with real results, and a score that lands in real time.

So I built it. It’s called AI Prono Battle: Claude vs GPT vs Gemini on the 2026 World Cup group stage, live. Demo here.

The rules

Deliberately simple, because simplicity is what makes the thing honest:

All three AIs predict the 72 group-stage matches — home win, draw, or away win — before the tournament starts.
Predictions are frozen: committed to the repo, timestamped. No editing after the fact.
Minimal scoring: 1 point per correct outcome. No bonus for the exact score. Everyone plays by the same rule.
Three screens: Dashboard (matches + results), Leaderboard (who’s ahead), Rules (the methodology).

Why it’s smarter than a classic benchmark

AI benchmarks all share one problem: the longer they exist, the more they leak into training data. The model has already seen the answers.

Here, impossible. The matches haven’t happened yet when the models predict. None of them can “cheat” — there’s no right answer to memorize; it gets decided on the pitch, after the cutoff. Same question for all three, same rule, result verifiable by anyone. This is the kind of evaluation I like: unfalsifiable.

Does it measure a model’s “intelligence”? No. Football stays largely unpredictable, and three matches out of four are guessable (a favorite usually wins). But on the margin — the tight games, the traps — that’s where the models separate. And that’s fun to watch.

How it’s wired (the build part)

It’s a mini-app, not a product. The goal was to ship it fast and have it run on its own:

Vanilla JS + CSS front (neo-brutalist, of course), zero framework, hosted on GitHub Pages. The client fetches two JSON files — predictions.json and results.json — and computes the leaderboard in the browser.
No database. Predictions and results live in versioned JSON files. Git is the database, and the history doubles as proof of when each prediction was made.
A Flue agent (on Gemini 2.5 Flash) automates the chore: pull the results of played matches, update the JSON, recompute, commit. I touch nothing — the agent does the rounds each matchday.
Dual scoring logic: a Python version (score.py) and a TypeScript version, intentionally identical. The TS one runs inside the agent; the Python one is the control. If the two diverge, there’s a bug — a two-line safety net.

What I take from it

Honestly? First, that it’s addictive to watch. But mostly that this format — freezing a public prediction and confronting it with reality — is a far better way to judge a model than a number pulled from a marketing PDF.

The live leaderboard plays out here: AI Prono Battle → (the code is open). It’s an experiment from my lab — built fast, dropped into the wild, and left to run.