A plain-English guide to how we taught a computer to play Hearts—from knowing nothing to beating most humans. No PhD required.
When you play Hearts against the AI on this site, you're playing against a neural network—a small mathematical brain that looks at the current state of the game (your cards, what's been played, the score) and decides which card to play.
But that brain didn't start smart. It started knowing absolutely nothing about Hearts. It's now on its 28th generation—each one trained longer, on harder opponents, with more features and human feedback baked in. Here's the journey:
Before any machine learning, we wrote a heuristic player—basically a long list of if/then rules that a decent Hearts player would follow. Things like:
We've actually written two versions of this player. The original wins about 91% of games against random players. Version 2 is a major rewrite with proper card counting, smart void creation during passing, and sophisticated dump-card prioritization. It's significantly stronger—the neural network only beats it about 59% of the time, compared to 70%+ against the original.
Both versions serve as training opponents. The AI plays about 65-75% of its training games against these heuristic players—they're its toughest sparring partners.
We created a neural network—think of it as a scoring machine. You feed it a description of the current game state, and it outputs a score for every card in your hand, saying "how good would it be to play this card right now?"
Every time it's the AI's turn, we convert the entire game situation into a list of 328 numbers. These include:
There's also a separate pass network with 72 features that evaluates which cards to pass at the start of each round, considering suit lengths, void creation potential, and dangerous high cards.
Notice what's not in those lists: the other players' hands. The AI has zero idea what cards you're holding. It knows what's been played (public information) and what's in its own hand. That's it. It's guessing about your cards just like you're guessing about its cards. If it just played the perfect counter to your brilliant strategy, that's pattern recognition, not x-ray vision.
Those 328 numbers first pass through an input projection that expands them to 512 dimensions. Then they flow through two residual blocks—each one normalizes the input (LayerNorm), runs it through a linear layer with ReLU activation, and adds the result back to the original. This "skip connection" is the same trick used in the networks behind modern image recognition and language models. It lets the network learn refinements on top of what it already knows, instead of having to reconstruct useful information at every layer.
After the residual blocks, a final output layer produces 52 numbers—one score per possible card. We mask out illegal moves and pick from the remaining options.
The whole thing has about 780,000 parameters (adjustable dials). That's small by modern AI standards—ChatGPT has billions—but it's more than enough for a card game. Earlier generations used plain feed-forward layers with 256 neurons (~135K parameters), but we found that residual blocks with 512 neurons play noticeably better and train more stably.
The first training phase is called supervised learning, or more specifically, imitation learning. It's exactly what it sounds like:
After watching about 10,000 games, the neural network plays almost as well as the expert it learned from. This gives us a solid starting point for the real training that follows.
Supervised learning has a ceiling: the student can only be as good as the teacher. To go beyond the rule-based expert, we use Proximal Policy Optimization (PPO)—a modern reinforcement learning algorithm.
The basic idea: let the AI play a ton of games, and after each game, tell it the score. Good score? Whatever you did, do more of it. Bad score? Do less of it. But the devil is in the details.
Simple policy gradients are noisy and unstable. PPO adds several things that make training reliable:
A key improvement: the AI now collects training data from all four seats at the table, not just one. In every training game, when the AI plays as any of the four players, it learns from that seat's perspective. This gives up to 4x more training data per game and means the AI learns to play well from every position.
We don't have the AI play only against copies of itself. That leads to weird strategies that only work against itself (we learned this the hard way—see Step 5). Instead, we use a mixed opponent pool that shifts as training progresses:
Each generation of training runs 200,000 games, taking about 2 hours. The learning rate starts high and drops as training progresses—big adjustments early, fine-tuning later.
Passing cards is one of the most strategic moments in Hearts, and historically our biggest weakness. We now also train the pass network with self-imitation learning: when the AI has a good round, the passes it made at the start get reinforced. Over time, it learns to pass dangerous high cards and create useful voids, instead of wasting pass slots on harmless low cards.
We also tried AlphaZero-style training—the same family of techniques DeepMind used to master Chess and Go. The key idea is Monte Carlo Tree Search (MCTS): before each move, the AI mentally simulates hundreds of possible futures, guided by the neural network.
Our first attempt ran 256 iterations of pure self-play with MCTS. The AI actually got worse—dropping from 96% vs random to 81%. It fell into an echo chamber, developing bizarre strategies that only worked against itself.
Version 2 fixed this by starting from our best PPO model, adding heuristic opponents (30% of games), lowering the randomness, and using stronger search (750 simulated games per move). This showed more promise but was much slower to train—MCTS is computationally expensive.
For now, PPO with mixed opponents produces the strongest players faster. MCTS remains our most promising avenue for future improvement—especially at play time rather than training time. Imagine the AI doing a quick 100-simulation lookahead before every move, on top of its trained instincts.
This is where it gets interesting. The AI doesn't just learn from self-play—it learns from you.
Every time you play a card, the AI quietly asks itself "what would I have done here?" If your choice differs from the AI's, we save that disagreement—your choice vs. the AI's choice, plus the full game state. We've collected over 700 of these so far.
We review every single disagreement by hand. For each one, we reconstruct the game state, reason through the strategic context (what's been played, who's void in what, is anyone threatening to shoot the moon), and decide who was right: the human, the AI, or whether it was a toss-up.
Out of 733 reviewed disagreements: humans were right 67% of the time, the AI was right 11%, and 22% were reasonable either way.
The 490 cases where the human was right become direct training signal. During PPO training, we mix in a small disagreement regularization loss—a gentle nudge saying "in this specific situation, the human's move was better." The AI learns to fix its exact weaknesses without forgetting everything else it knows.
We also log every trick of every game played on this site—a complete play-by-play. We periodically review these logs to spot patterns of bad play that the disagreement system might miss (because disagreements only capture moments where the human actively chose differently). This has caught things like the AI dumping the Queen of Spades on a player who's clearly shooting the moon, or playing an Ace when it could easily duck under.
All the training happens in Python with PyTorch on a development machine. But you play in a web browser. So we:
The AI runs entirely on your device. Your cards never leave your browser. There's no server call happening where a sneaky algorithm peeks at everyone's hand and sends back the perfect move. The neural network sitting in your browser tab gets the same information you would get if you were playing at a kitchen table: its own 13 cards and whatever has been played face-up.
No. We know. You just lost three games in a row and the AI dumped the Queen of Spades on you every single time. It feels personal. We get it. But here's the full breakdown of why the AI is playing fair:
The AI receives the same 328-number game state described above. That state includes its own hand, the cards played so far, and the current trick. It does not include anyone else's hand. If you don't believe us, the code is the proof: the feature encoder literally doesn't have access to other players' cards. There is no secret "peek" flag.
The deck is shuffled using a standard random shuffle before every round. The AI has absolutely no influence over which cards go where. It gets whatever 13 cards fate (well, Math.random) gives it, same as you. If it got dealt all low cards and you got stuck with the Ace-King-Queen of Spades, that's the universe being rude, not us.
Each AI player makes its decisions independently, based only on its own hand and the public game state. They don't share notes. They don't wink at each other. They don't have a group chat called "Ruin the Human's Day." If two AI players happen to gang up on you in the same trick, that's just good (independent) pattern recognition converging on the same obvious play.
Because the AI has played millions of games and you haven't. It has seen every possible situation enough times to develop extremely good instincts. When it ducks your Ace at the perfect moment or dumps the Queen on you right when you're vulnerable, it's not because it saw it coming—it's because it's been in that situation a million times before and learned what works. That's not cheating. That's practice. (Annoying, ungodly amounts of practice.)
And honestly? It still makes dumb mistakes sometimes. It'll play an Ace to win a trick full of hearts when it could have ducked. It'll pass you low cards instead of dangerous ones. It's getting better every generation, but it's far from perfect. If anything, the fact that it makes mistakes should reassure you that it's not omniscient.
Every time you play a card, the AI quietly asks itself "what would I have done here?" If your choice is different from what the AI would have picked, we save that disagreement—your choice vs. the AI's choice—so we can study it later and figure out who was right. This happens whether or not "Teach me" mode is on. (Teach me mode just lets you see the AI's recommendation before you play.)
We also log a complete play-by-play of every game—every trick, every card, every score. This lets us study the AI's behavior across entire games, not just individual decisions.
This is how the AI keeps getting better: by learning from the moments where humans and the AI see things differently, and by reviewing its own game tapes.
The game state at the moment of disagreement: which cards were in your hand, what's on the table, the score, and what you each picked. Plus a trick-by-trick replay of the full game. That's it. No names, no emails, no accounts. Your identity is a random ID that gets thrown away when you close the tab. We couldn't figure out who you are even if we wanted to.
Your name, your location, your browser fingerprint, your deepest secrets, or anything else about you as a person. We literally do not have a user account system. You're just a sequence of card plays to us, and a beautiful one at that.
The entire game runs in your browser—no internet required after the initial page load. If you lose connectivity mid-game, nothing changes except the disagreement data quietly doesn't get sent. No error messages, no interruptions, no sad spinners. You'll see a small "offline" indicator in the toolbar, and the game keeps going like nothing happened. Because, for the game, nothing did.
A mathematical function with hundreds of thousands of adjustable numbers ("parameters") that transforms an input (game state) into an output (card scores). It learns by adjusting those numbers to improve its performance.
Training by showing correct examples. "Here's the situation, here's what an expert did." The network learns to mimic the expert.
Training by trial and error. The network plays games, receives a score at the end, and gradually figures out which actions lead to good outcomes.
A modern RL algorithm that uses clipped updates and a learned value baseline to make training stable. Think of it as "RL with guardrails"—the AI can't change too much from any single batch of games, preventing wild strategy swings.
The AI's strategy—a mapping from "what I see" to "what I do." The neural network is the policy.
A second output of the neural network that predicts "how good is this position for me?" Used during training to compute advantages (did I do better or worse than expected?) and by MCTS to evaluate game states without playing them out to the end.
A technique that figures out which specific decisions in a round contributed to the outcome. Without it, every decision in a winning round gets equal credit—even the bad ones. GAE gives credit where it's due.
An algorithm that searches through possible future moves to find the best one. "Monte Carlo" because it uses random sampling to explore possibilities. "Tree" because the branching possibilities form a tree structure.
A technique for handling hidden information. Since we can't see opponents' cards, we randomly fill in what they might have (respecting what we know), search as if we could see everything, and average the results across multiple guesses.
A training approach (pioneered by DeepMind for Chess and Go) that combines neural networks with MCTS. The network guides the search, and the search generates better training data for the network. Both improve together.
A saved snapshot of the AI at a point in training. Older generations become opponents for the current version, ensuring it keeps improving and doesn't forget how to beat earlier strategies. We're currently on Generation 28.
Instead of passing data straight through each layer, a residual block adds the layer's output back to its input: output = layer(x) + x. This lets the network learn small refinements at each layer rather than having to reconstruct all information from scratch. It's the same idea behind ResNets, Transformers, and most modern deep learning architectures.
A technique that normalizes the values flowing through the network to have zero mean and unit variance at each layer. This stabilizes training when reward signals are noisy—which they are in Hearts, where the randomness of the deal means sometimes you lose no matter how well you play.
Instead of only scoring at the end of a round ("you got 15 points, bad"), the AI receives a small reward after every single trick. This makes it much easier to figure out which specific decisions were good or bad, rather than spreading blame (or credit) across all 13 tricks.
A small reward for being unpredictable. Without it, the AI might always play the same card in a given situation, making it exploitable. The entropy bonus encourages exploration—it starts high and decreases as training progresses.
During training, we mix in examples of human-correct disagreements as an additional learning signal. This directly targets the AI's known weaknesses without requiring it to stumble onto those situations randomly during self-play.