← Back to Card Game Hustler

How the AI Was Built

A plain-English guide to how we taught a computer to play Hearts—from knowing nothing to beating most humans. No PhD required.

The Big Picture

When you play Hearts against the AI on this site, you're playing against a neural network—a small mathematical brain that looks at the current state of the game (your cards, what's been played, the score) and decides which card to play.

But that brain didn't start smart. It started knowing absolutely nothing about Hearts. It's now on its 28th generation—each one trained longer, on harder opponents, with more features and human feedback baked in. Here's the journey:

But first, the elephant in the room: The AI cannot see your cards. It cannot see anyone's cards except its own. It doesn't deal itself good hands. It doesn't peek at the deck. It plays by the exact same rules you do, with the exact same information you'd have sitting in its chair. Everything you're about to read is how it learned to be good without cheating. We promise. Pinky swear. Would we lie to you?

Rule-Based
Expert

→

Watch &
Imitate

→

PPO Mixed-Play
(200K games)

→

Human
Feedback

→

Play in
Your Browser

Step by Step

Write down the rules

Before any machine learning, we wrote a heuristic player—basically a long list of if/then rules that a decent Hearts player would follow. Things like:

If you can dump the Queen of Spades on someone, do it.
If you're void in a suit and someone leads it, throw your highest hearts.
If someone might be shooting the moon, stop them.
Lead low cards from short suits to create voids.

We've actually written two versions of this player. The original wins about 91% of games against random players. Version 2 is a major rewrite with proper card counting, smart void creation during passing, and sophisticated dump-card prioritization. It's significantly stronger—the neural network only beats it about 59% of the time, compared to 70%+ against the original.

Both versions serve as training opponents. The AI plays about 65-75% of its training games against these heuristic players—they're its toughest sparring partners.

Analogy: This is like writing a recipe. It works, but it can't taste the food and adjust the seasoning.

Build a brain (Neural Network)

We created a neural network—think of it as a scoring machine. You feed it a description of the current game state, and it outputs a score for every card in your hand, saying "how good would it be to play this card right now?"

What the brain sees (328 numbers)

Every time it's the AI's turn, we convert the entire game situation into a list of 328 numbers. These include:

Which cards are in my hand (52 yes/no values)
Which cards have already been played this round
What's in the current trick, with detailed per-card features (suit, rank, points)
How many points each player has taken
Whether hearts have been broken
Which suits each opponent appears to be void in (inferred from their plays)
Whether anyone is threatening to "shoot the moon," and how strong their hand looks
Card counting: which suits I dominate, threats remaining per suit, high cards still outstanding
Tactical features: can I duck this trick? How many aces and high hearts do I hold? How many points are on the table right now?

There's also a separate pass network with 72 features that evaluates which cards to pass at the start of each round, considering suit lengths, void creation potential, and dangerous high cards.

Notice what's not in those lists: the other players' hands. The AI has zero idea what cards you're holding. It knows what's been played (public information) and what's in its own hand. That's it. It's guessing about your cards just like you're guessing about its cards. If it just played the perfect counter to your brilliant strategy, that's pattern recognition, not x-ray vision.

How the brain thinks (residual blocks)

Those 328 numbers first pass through an input projection that expands them to 512 dimensions. Then they flow through two residual blocks—each one normalizes the input (LayerNorm), runs it through a linear layer with ReLU activation, and adds the result back to the original. This "skip connection" is the same trick used in the networks behind modern image recognition and language models. It lets the network learn refinements on top of what it already knows, instead of having to reconstruct useful information at every layer.

After the residual blocks, a final output layer produces 52 numbers—one score per possible card. We mask out illegal moves and pick from the remaining options.

The whole thing has about 780,000 parameters (adjustable dials). That's small by modern AI standards—ChatGPT has billions—but it's more than enough for a card game. Earlier generations used plain feed-forward layers with 256 neurons (~135K parameters), but we found that residual blocks with 512 neurons play noticeably better and train more stably.

Analogy: The neural network is like a new hire at a poker table. It can see the cards, but it hasn't played a single hand yet. It needs training.

Learn by watching (Supervised Learning)

The first training phase is called supervised learning, or more specifically, imitation learning. It's exactly what it sounds like:

We let the rule-based expert from Step 1 play thousands of games
We record every decision: "In this situation, the expert played this card"
We show the neural network the same situation and ask "what would you play?"
If it disagrees with the expert, we nudge its parameters to make it more likely to pick the expert's choice next time

After watching about 10,000 games, the neural network plays almost as well as the expert it learned from. This gives us a solid starting point for the real training that follows.

Analogy: This is like a student driver sitting next to an instructor. "When you see a stop sign, hit the brakes." After enough examples, they get the idea.

Learn by doing (PPO Reinforcement Learning)

Supervised learning has a ceiling: the student can only be as good as the teacher. To go beyond the rule-based expert, we use Proximal Policy Optimization (PPO)—a modern reinforcement learning algorithm.

The basic idea: let the AI play a ton of games, and after each game, tell it the score. Good score? Whatever you did, do more of it. Bad score? Do less of it. But the devil is in the details.

Why PPO instead of basic RL?

Simple policy gradients are noisy and unstable. PPO adds several things that make training reliable:

A value function that learns to predict "how good is this position?" This gives a baseline—instead of asking "did I win?", we ask "did I do better or worse than expected?" Much cleaner signal.
Clipped updates that prevent any single batch of games from changing the network too drastically. No wild swings.
GAE (Generalized Advantage Estimation) that figures out which specific decisions in a round contributed to the outcome, not just whether the round went well overall.
Dense per-trick rewards instead of only a score at the end of the round. After every single trick, the AI gets immediate feedback: "you just took 13 points, that was bad" or "clean trick, nice." This makes credit assignment much sharper—combined with a placement-based reward at the end (1st place is great, 4th is terrible), the AI knows exactly which decisions hurt it.
Entropy regulation with KL early stopping that keeps the AI exploring enough different plays without collapsing into a rigid "always play the same card" strategy. If the policy changes too fast in one training step, we stop early and move on.

All-seat training

A key improvement: the AI now collects training data from all four seats at the table, not just one. In every training game, when the AI plays as any of the four players, it learns from that seat's perspective. This gives up to 4x more training data per game and means the AI learns to play well from every position.

Training against tough opponents

We don't have the AI play only against copies of itself. That leads to weird strategies that only work against itself (we learned this the hard way—see Step 5). Instead, we use a mixed opponent pool that shifts as training progresses:

Diverse heuristic players (~50%) — four different styles (balanced, aggressive, defensive, and moon-hunting) that force the AI to handle varied strategies. The strongest training signal.
Self-play (~30-35%) — playing against copies of the current version pushes it beyond its current level
Older neural network generations (~13%) — 28 snapshots from prior training runs, so it doesn't forget how to beat past strategies
Random players (~2-10%) — to stay sharp against unpredictable moves

Each generation of training runs 200,000 games, taking about 2 hours. The learning rate starts high and drops as training progresses—big adjustments early, fine-tuning later.

Pass network reinforcement learning

Passing cards is one of the most strategic moments in Hearts, and historically our biggest weakness. We now also train the pass network with self-imitation learning: when the AI has a good round, the passes it made at the start get reinforced. Over time, it learns to pass dangerous high cards and create useful voids, instead of wasting pass slots on harmless low cards.

Analogy: PPO is like learning to cook with a food critic who scores every dish individually (dense rewards), not just your overall meal. You're cooking against four different expert chefs with different styles (diverse opponents), and you're watching from every seat in the kitchen (all-seat training). And you're also learning which ingredients to buy (pass RL).

The AlphaZero detour (and what we learned)

We also tried AlphaZero-style training—the same family of techniques DeepMind used to master Chess and Go. The key idea is Monte Carlo Tree Search (MCTS): before each move, the AI mentally simulates hundreds of possible futures, guided by the neural network.

What happened

Our first attempt ran 256 iterations of pure self-play with MCTS. The AI actually got worse—dropping from 96% vs random to 81%. It fell into an echo chamber, developing bizarre strategies that only worked against itself.

Version 2 fixed this by starting from our best PPO model, adding heuristic opponents (30% of games), lowering the randomness, and using stronger search (750 simulated games per move). This showed more promise but was much slower to train—MCTS is computationally expensive.

For now, PPO with mixed opponents produces the strongest players faster. MCTS remains our most promising avenue for future improvement—especially at play time rather than training time. Imagine the AI doing a quick 100-simulation lookahead before every move, on top of its trained instincts.

Analogy: MCTS is like a chess player thinking "if I move here, they'll probably move there, and then I can..." The neural network provides the intuition that tells them which moves are worth thinking about. But sometimes too much thinking in a vacuum makes you weird. You need real opponents to stay grounded.

Learn from humans (The Feedback Loop)

This is where it gets interesting. The AI doesn't just learn from self-play—it learns from you.

Disagreement tracking

Every time you play a card, the AI quietly asks itself "what would I have done here?" If your choice differs from the AI's, we save that disagreement—your choice vs. the AI's choice, plus the full game state. We've collected over 700 of these so far.

Expert evaluation

We review every single disagreement by hand. For each one, we reconstruct the game state, reason through the strategic context (what's been played, who's void in what, is anyone threatening to shoot the moon), and decide who was right: the human, the AI, or whether it was a toss-up.

Out of 733 reviewed disagreements: humans were right 67% of the time, the AI was right 11%, and 22% were reasonable either way.

Feeding it back into training

The 490 cases where the human was right become direct training signal. During PPO training, we mix in a small disagreement regularization loss—a gentle nudge saying "in this specific situation, the human's move was better." The AI learns to fix its exact weaknesses without forgetting everything else it knows.

Game log analysis

We also log every trick of every game played on this site—a complete play-by-play. We periodically review these logs to spot patterns of bad play that the disagreement system might miss (because disagreements only capture moments where the human actively chose differently). This has caught things like the AI dumping the Queen of Spades on a player who's clearly shooting the moon, or playing an Ace when it could easily duck under.

Analogy: Disagreements are like film study in sports. You watch the tape, identify the mistakes, and drill specifically on those situations in practice. The AI's "practice" is 200,000 more games, but with a coach whispering "remember, this is one of those situations where you screwed up last time."

Ship it to your browser

All the training happens in Python with PyTorch on a development machine. But you play in a web browser. So we:

Export the weights — dump every number from the trained network into a compact binary file (~3 MB, down from 16 MB in the old JSON format). The binary format (v2) includes the residual block structure—linear layer weights, LayerNorm parameters, and architecture flags in a compact header.
Rebuild the math in TypeScript — matrix multiplication, ReLU activation, layer normalization, residual skip connections, softmax. Just basic arithmetic, no ML framework needed. The browser code auto-detects whether the model uses residual blocks or the older plain architecture.
Compress over the wire — the binary format compresses to about 2.7 MB with standard web compression. On a typical connection, the AI loads in a few seconds.
Run it in your browser — each move takes less than 10 milliseconds to compute. No server needed, no internet lag

The AI runs entirely on your device. Your cards never leave your browser. There's no server call happening where a sneaky algorithm peeks at everyone's hand and sends back the perfect move. The neural network sitting in your browser tab gets the same information you would get if you were playing at a kitchen table: its own 13 cards and whatever has been played face-up.

Analogy: Training is like going to cooking school for years. Exporting is writing down your favorite recipes on an index card. Playing in the browser is cooking from that card—fast and portable.

By the Numbers

328

Play Features

780K

Parameters

Generations

2.7 MB

Download Size

200K

Games per Gen

733

Disagreements Reviewed

<10ms

Per Move

96%

Win Rate vs Random

OK But Is the AI Cheating Though?

No. We know. You just lost three games in a row and the AI dumped the Queen of Spades on you every single time. It feels personal. We get it. But here's the full breakdown of why the AI is playing fair:

It can't see your cards

The AI receives the same 328-number game state described above. That state includes its own hand, the cards played so far, and the current trick. It does not include anyone else's hand. If you don't believe us, the code is the proof: the feature encoder literally doesn't have access to other players' cards. There is no secret "peek" flag.

It can't rig the deal

The deck is shuffled using a standard random shuffle before every round. The AI has absolutely no influence over which cards go where. It gets whatever 13 cards fate (well, Math.random) gives it, same as you. If it got dealt all low cards and you got stuck with the Ace-King-Queen of Spades, that's the universe being rude, not us.

It can't coordinate with the other AI players

Each AI player makes its decisions independently, based only on its own hand and the public game state. They don't share notes. They don't wink at each other. They don't have a group chat called "Ruin the Human's Day." If two AI players happen to gang up on you in the same trick, that's just good (independent) pattern recognition converging on the same obvious play.

So why does it feel like it's cheating?

Because the AI has played millions of games and you haven't. It has seen every possible situation enough times to develop extremely good instincts. When it ducks your Ace at the perfect moment or dumps the Queen on you right when you're vulnerable, it's not because it saw it coming—it's because it's been in that situation a million times before and learned what works. That's not cheating. That's practice. (Annoying, ungodly amounts of practice.)

And honestly? It still makes dumb mistakes sometimes. It'll play an Ace to win a trick full of hearts when it could have ducked. It'll pass you low cards instead of dangerous ones. It's getting better every generation, but it's far from perfect. If anything, the fact that it makes mistakes should reassure you that it's not omniscient.

Still skeptical? Turn on "Teach me" mode in the game. You'll see what the AI would play in your position, using only the information available to you. If it were cheating, its advice wouldn't be useful—because it would be based on information you don't have. But it is useful. Because it's playing fair. We rest our case.

Learning From You (Anonymously)

Every time you play a card, the AI quietly asks itself "what would I have done here?" If your choice is different from what the AI would have picked, we save that disagreement—your choice vs. the AI's choice—so we can study it later and figure out who was right. This happens whether or not "Teach me" mode is on. (Teach me mode just lets you see the AI's recommendation before you play.)

We also log a complete play-by-play of every game—every trick, every card, every score. This lets us study the AI's behavior across entire games, not just individual decisions.

This is how the AI keeps getting better: by learning from the moments where humans and the AI see things differently, and by reviewing its own game tapes.

What we collect

The game state at the moment of disagreement: which cards were in your hand, what's on the table, the score, and what you each picked. Plus a trick-by-trick replay of the full game. That's it. No names, no emails, no accounts. Your identity is a random ID that gets thrown away when you close the tab. We couldn't figure out who you are even if we wanted to.

What we don't collect

Your name, your location, your browser fingerprint, your deepest secrets, or anything else about you as a person. We literally do not have a user account system. You're just a sequence of card plays to us, and a beautiful one at that.

Works offline too

The entire game runs in your browser—no internet required after the initial page load. If you lose connectivity mid-game, nothing changes except the disagreement data quietly doesn't get sent. No error messages, no interruptions, no sad spinners. You'll see a small "offline" indicator in the toolbar, and the game keeps going like nothing happened. Because, for the game, nothing did.

Glossary

Neural Network

A mathematical function with hundreds of thousands of adjustable numbers ("parameters") that transforms an input (game state) into an output (card scores). It learns by adjusting those numbers to improve its performance.

Supervised Learning

Training by showing correct examples. "Here's the situation, here's what an expert did." The network learns to mimic the expert.

Reinforcement Learning (RL)

Training by trial and error. The network plays games, receives a score at the end, and gradually figures out which actions lead to good outcomes.

PPO (Proximal Policy Optimization)

A modern RL algorithm that uses clipped updates and a learned value baseline to make training stable. Think of it as "RL with guardrails"—the AI can't change too much from any single batch of games, preventing wild strategy swings.

Policy

The AI's strategy—a mapping from "what I see" to "what I do." The neural network is the policy.

Value Function

A second output of the neural network that predicts "how good is this position for me?" Used during training to compute advantages (did I do better or worse than expected?) and by MCTS to evaluate game states without playing them out to the end.

GAE (Generalized Advantage Estimation)

A technique that figures out which specific decisions in a round contributed to the outcome. Without it, every decision in a winning round gets equal credit—even the bad ones. GAE gives credit where it's due.

Monte Carlo Tree Search (MCTS)

An algorithm that searches through possible future moves to find the best one. "Monte Carlo" because it uses random sampling to explore possibilities. "Tree" because the branching possibilities form a tree structure.

Determinization

A technique for handling hidden information. Since we can't see opponents' cards, we randomly fill in what they might have (respecting what we know), search as if we could see everything, and average the results across multiple guesses.

AlphaZero

A training approach (pioneered by DeepMind for Chess and Go) that combines neural networks with MCTS. The network guides the search, and the search generates better training data for the network. Both improve together.

Generation

A saved snapshot of the AI at a point in training. Older generations become opponents for the current version, ensuring it keeps improving and doesn't forget how to beat earlier strategies. We're currently on Generation 28.

Residual Connection (Skip Connection)

Instead of passing data straight through each layer, a residual block adds the layer's output back to its input: output = layer(x) + x. This lets the network learn small refinements at each layer rather than having to reconstruct all information from scratch. It's the same idea behind ResNets, Transformers, and most modern deep learning architectures.

Layer Normalization

A technique that normalizes the values flowing through the network to have zero mean and unit variance at each layer. This stabilizes training when reward signals are noisy—which they are in Hearts, where the randomness of the deal means sometimes you lose no matter how well you play.

Dense Rewards

Instead of only scoring at the end of a round ("you got 15 points, bad"), the AI receives a small reward after every single trick. This makes it much easier to figure out which specific decisions were good or bad, rather than spreading blame (or credit) across all 13 tricks.

Entropy Bonus

A small reward for being unpredictable. Without it, the AI might always play the same card in a given situation, making it exploitable. The entropy bonus encourages exploration—it starts high and decreases as training progresses.

Disagreement Regularization

During training, we mix in examples of human-correct disagreements as an additional learning signal. This directly targets the AI's known weaknesses without requiring it to stumble onto those situations randomly during self-play.

The Tech Stack

Game engine: TypeScript — immutable state machine, shared between training and browser
Training: Python + PyTorch — PPO with multiprocessing, ~37 games/second on a Mac Mini
Browser AI: Vanilla TypeScript — hand-written matrix math (linear layers, LayerNorm, residual connections), no ML framework, binary weight loading
Model format: Custom binary v2 — float32 weights with architecture-aware header (residual blocks, LayerNorm params), ~3 MB (2.7 MB compressed)
Database: Turso (libSQL) — stores disagreements and game logs for analysis
Bundler: esbuild — fast builds, tiny output
Hosting: Vercel — static files + serverless API, builds via GitHub Actions