HANABI
.
. . .'. \ /
\ / .'. .' '.' ' -= o =-
-= o =- .' ' / \
/ \ '
'
In this post I will go through how I implemented a cooperative multi-agent environment using Prime Intellect’s stack as part of their RL Residency. My objective is two-fold:
- To show how multi-agent environments can already be designed using the verifiers library and how training can be done on such environments using both prime-rl and hosted training.
- To propose and discuss abstractions that could be included into verifiers to allow for more ergonomic multi-agent designs in the future.
To do this, I’ve chosen to work with Hanabi, a simple cooperative card game, which is already a well-known environment in multi-agent RL (MARL) research.
Motivation
Many tasks involve some degree of exchange with other agents (human or AI). In traditional (non-LLM-based) RL, multi-agent setups are common and fairly well studied. More broadly, multiagency connects to fundamental research on the nature of intelligence itself — programs like Blaise Agüera y Arcas’ What is Intelligence? initiative frame social interaction, communication, and coordination as central to understanding cognition, not peripheral to it. However, in public/open-source LLM-based RL, multi-agent setups are arguably still very under-explored. Having said that, it seems recently the field has shown much more interest in the topic: examples range from Anthropic’s multi-agent research system last year and this year’s Claude Code teams to Kimi’s Agent Swarms, and even stuff like Moltbook and, perhaps less-obviously multi-agent setups like RLMs or speculative forking. While most of these efforts focus on multi-agent orchestration at inference-time, the question of how to train in multi-agent settings is still quite nascent, at least out in the open. In the spirit of Prime Intellect’s inspiring work, I hope this project contributes to the open, shared research around these topics, especially the latter.
The game
I first heard about Hanabi through the work of Jakob Foerster. I believe it was either through The Hanabi Challenge or the BAD for Deep MARL papers. A friend of mine, sci_sic, shortly after bought the game and encouraged me to do so too. It’s a very fun game! And ever since then it’s actually been my go-to whenever we have friends or family over at home.
I searched through my photos for a Hanabi game-night. Here’s a cool one I found of my brother and I playing a few years ago:
In short:
Hanabi is a cooperative card game for 2-5 players where everyone holds their cards facing outward—you can see everyone's cards except your own. Players work together to build five firework piles (one per color) by playing cards in ascending order from 1 to 5. Communication is strictly limited: you can only give clues about a single color or number in another player's hand, and the team shares a limited pool of clue tokens. A clue token is regained when a 5 is successfully played, and discarding cards also restores tokens but risks losing critical cards. The game ends when all fireworks are complete (perfect score of 25), the deck runs out, or the team makes three mistakes.
So Hanabi is simultaneously like team solitaire + blind-man’s bluff insofar as it has an inherent objective of arranging the cards in a specific order (i.e. the fireworks) but in a collaborative way, with the main constraint being the fact that each player has imperfect information about the game state. I’ll expand on the things I find particularly interesting.
Emergent communication
From my experience, the richness of the game comes from the emergent communication conventions that arise naturally from the interaction between players. The clearest example of this might be when, during play, a participant wants to communicate something that is, so to speak, under-determined. For example, assume the fireworks show:
And you see the other player has the following hand:
Imagine such player knows nothing else about their hand from previous interactions. Evidently, it would be good to communicate to them that they have a B1, which is playable. However, hints can only communicate information about numbers or colors. Would it be best to communicate that such a card is a 1? If so, the player may take it as a hint that it’s a playable 1, or perhaps it’s best to communicate it’s blue, and hope that the player interprets the hint as saying it’s a playable blue. Both are valid paths, and the success or failure of effective communication following either could set a standard for subsequent interactions.
The philosophical tenet behind this runs deep and beyond the scope of this post, but it seems to me this is the sort of environment that allows us to explore how communication rules are emergent phenomena arising from social interaction. Wittgenstein argued that meaning cannot exist in isolation but only through shared social use; Hanabi makes this concrete by forcing players to build shared interpretive conventions from scratch.
Theory of mind
Another important aspect that is possible to emphasize in Hanabi is that of theory of mind (ToM), which refers to “the capacity to understand other individuals by ascribing mental states to them”. In Hanabi, ToM is not just useful, it is a structurally necessary in the sense that good performance follows from the each player’s ability to infer what the others are trying to communicate via partial hints and also what and how to communicate to others in order to maximize the probability of them playing well.
It is particularly noticeable how, near the end of a game, players can start counting remaining cards more easily in order to infer their hand. Communication follows not only from what players can infer directly from counting, but also from trying to count from other player’s perspective. An example:
Imagine it’s near the end of the game, and fireworks show:
The other player has the following hand:
There are some cards in the deck still, and the rest of the cards are in the discard pile, and you can see all blue cards are either in the discard pile, except for that B4 the other player has and 1 B5 and 1 B4, which, from your perspective, could be either in the deck or in your hand.
You have B4, R4, Y2, R1, Y3, which you can’t see of course.
It’s the other player’s turn, but they don’t know anything about their cards just yet, They can count and infer which cards they have but they don’t know which is which. They proceed to give you a hint and tell you which card is the blue one. You reason that although your blue card could be either a 4 or a 5, the fact that they hinted you the color may indicate that it’s a playable card (perhaps enforced by previous successful communications). You could play the card now, though an alternative could be to actually discard such card, with the hope that when you do so, the other player will deduce that you did because you could see the other B4 in their hand.
This example shows that situations requiring nested theory of mind arise naturally in Hanabi. Specifically, it illustrates third-order ToM: (1) you model the other player’s reasoning about their own hand; (2) you model how they model your likely actions given what you can see; and (3) you choose an action (discarding) precisely because you can predict how they will interpret it — reasoning about their reasoning about your reasoning.
Cooperation
One last aspect about Hanabi is the fact that it’s a fully-cooperative / common-payoff game. That is, there’s no player-specific attribution of scores – it’s about getting the highest score together. This matters theoretically because most games are adversarial and cooperative settings are arguably under-explored, and practically, because the shared reward makes it a lot easier to implement as a training environment, given we don’t care about credit assignment across agents.
Children's Games, Pieter Bruegel the Elder (1560), depicted on the cover of Shoham and Leyton-Brown's book on multi-agent systems
The cooperative aspect of Hanabi has another interesting technical implication: learning to cooperate in tandem is different than learning to cooperate irrespective of who you’re playing with. Playing with the same group of friends makes it easier to always exploit the same conventions, but a good player may also learn to adapt to different group signals, or adapting to less or more experienced players. Self-play allows for the former, and it’s the main focus of this work, but cross-play is an interesting direction for further work.
Experiments
There’s nooks and crannies to all the experiments I did, but I will try to focus on the most salient things from here on out. One important thing to note, though, is that most of the experiments I did were in 2-player settings, mostly for the practical reason that adding more players increases the number trajectories used to train per-sample, making it harder to inject sample-diversity in the training steps, without a clear benefit on the objectives of this work.
Preliminary results
Before outlining the details about the environment, I would like to show some of the preliminary experiments I had with Hanabi, which illustrate the baseline behavior of models in this environment.
Baseline
One of the very first things I did after constructing a basic version of the environment was to benchmark some models. To my surprise, models like gpt-4.1-mini already did fairly well in the game, getting mean-scores of ~3.46 but quite a high variance in the score distribution, with some rollouts reaching as high as 18 points, which, by the standards of the game, is considered to be within the “excellent, crowd pleasing” range:
Specifically, the traces showed sound reasoning steps. An example from a rollout:
The game just started. All fireworks are at 0, no cards have been played, and I have no knowledge about my own hand (all unknowns \"??\"). Player 1 has some cards visible: Y5, R3, B1, Y1, G2.
Since my hand is totally unknown, a good first move is to give a hint to player 1 to help them identify playable cards or card value. The earliest playable cards are all 1s. Player 1 has Y1 at position 3. Revealing ranks (\"1\") or colors (\"Y\") will help them know that the Y1 is playable now.
I'll give a hint about rank '1' to player 1. This hint will highlight Y1 and B1 (they have B1 as well). Among those, only Y1 is currently playable (the 1's because fireworks are at 0). Giving a rank hint '1' lets player 1 know they have some 1s to consider playing.
Using 1 info token (we have max 8), but that’s fine.
So: action is \"1H1\" (hint to player 1 about which cards are '1').
Of course, the thought-process here was not perfect. For instance, both Y1 and B1 here were playable, not just Y1, but it did, in the end, reach the most obvious action to start the game with.
RL
This was enough to start with. A first effort was directed towards finding out how good a Qwen3-0.6B model could get with no previous SFT-warmup at all. Interestingly, RL pushed it at least towards non-zero scores:
The model started with overly extensive and unsound reasoning, failing to output a valid action:
Okay, let's see. I need to figure out the next move to build the fireworks. The game is about playing cards from 1 to 5, so I need to play each card one after another, maybe with some hints.
First, looking at the current state. The player has hands hand <player_0_hand>, which is unknown, but there are 5 positions. So I need to figure out which set of cards I have. The player_0_hand's info is not fully given, but maybe looking at the cards in player_1's hand. Player_1 has G1, Y5, Y3, Y1, G4. Wait, but the firework has R0, Y0, G0, W0, B0. So each color has a firework start.
The first action I should take is to play a card. The first position is 0, which is colored R. But the player knows that the firework starts here, so I need to check if I have that color. However, since I don't know my own cards, but the player_0_hand is unknown. Wait, maybe I can deduce which color is at position 0. The player's own hand is unknown, but the firework starts with those colors. So if I play R0, that's correct.
But I have to think about possible cards. The order of play is from 1 to 5, but maybe the first step is possible. Also, if I play the first card, it might affect the game. Let me try to play R0. Wait, but the player's hand is unknown. Alternatively, maybe I need to choose based on the available info.
Wait, the player_0_hand's format is ?? ?? ?? ?? ??, which are unknown. But perhaps in this game, we have to deduce based on the other player's information. Player_1's hand has G1, Y5, Y3, Y1, G4. Those positions are probably represented as known colors. So if I play R0, that's correct.
Alternatively, maybe I can check if there's a card that's not among the firework colors yet. The current firework has R0, Y0, etc. So the next action would be to play R0. But how? Because the hand is unknown. Maybe I need to look at player_1's hand again. Wait, the firework starts at their
And by the end of a short training run, managed to reach an average of ~1.25 points. However, these scores were attained with suboptimal behavior, like the following:
Okay, so the user provided a game state and I need to figure out what the next action is. Let me start by understanding the current state.
The game state shows that the user's hand is unknown, but the player_0_hand is "?? ?? ?? ?? ??". The player 1's hand is "Y5 R2 B1 G1 B4". The firework colors are R0, Y0, G0, W0, B0. The score is 0, and the deck has 40 cards. The life tokens are 3, which means we have three lives left. The info tokens are 8, so there are 8 hints available.
The user (player 0) hasn't played any cards yet. The next step is to play a card. The available actions are P+position, D+position, or player + H + color/rank. Since the next action is the first action, which is playing a card. The position is 0 because it's the first card in the hand. The card isn't specified, so the action would be <action>P0</action>. But I need to make sure that the position is correct. The user input shows that the first card is index 0. So the answer is to play position 0
These results show that the models learnt to climb up to a reward of at least 1 on average by just learning to always “play” a card at ~random. This was obviously not a strategy that would get very far.
SFT
To induce a more adequate behavior prior to the RL run, I created a synthetic dataset, using the gpt-4.1-mini rollouts. This dataset was created with a setting of 50 examples and 2 rollouts per example in 2-player scenarios, resulting in 200 (50*2*2) training traces. This time I tried a slightly bigger Qwen3-1.7B model.
The results showed a noticeable improvement over the previous run, probably due to both the change in the base model and the SFT. More importantly, the reasoning and game-strategy were qualitatively better. For example, this is taken from a rollout at the beginning of the training run which already obtained a score of 2:
At the start of the game, no information tokens have been spent, and our hand is completely unknown (all ??). The other player's hand is visible: W4 W3 G2 B1 B1. None of their cards are playable immediately, as none are 1s. We can't give a hint on color or rank yet, so maybe playing a card that we know is playable now is the best option.
Looking at our own hand, since we have full information tokens (8), we should give a hint to help Player 1 identify cards they can play before the deck runs out. The best hint to give might be rank 1, as all the 1s in Player 1's hand are very valuable and need to be played early.
So our hint could be Player 1 H1, telling them which cards are 1s. This will help them quickly identify and play their 1s earlier in the game, which is beneficial as the deck is full and the first cards to play are those with rank 1.
No immediate risk here as playing a card with rank 1 is the best starting move once the deck is full, and giving a hint will help the other player, which is smart play.
Although it’s clear that the reasoning is not impeccable, it already showed a better game-strategy by opting for an initial hint.
By step 50, we have rollouts with scores as high as 5 and using similar reasoning/strategies at the first turns:
We are at the very start of the game with full knowledge of the other player's hand (player 1) and all info tokens (8). My own cards are completely unknown: "?? ?? ?? ?? ??".
The firework piles are all at zero, so the next cards to play will be the 1s of each color. On player 1's hand, we see Y1, Y5, B4, Y1, W1. There are 3 yellow cards with 1, which means these are the immediate playable cards. Since info tokens are maxed, it is best to give a hint to Player 1 to help them identify their 1s.
Player 1 has Y1, Y1, W1, and Y5, B4. They have two distinct 1s with unknown positions (positions 0, 1, 2 might be part of the 1s, but not sure yet). So, the best initial hint is about the 1s, as they are the next cards to play.
Player 1 has no 1s at 3 or higher with a known rank yet. Since 1s are the only immediate playable cards, it makes sense to give a 1 hint to help them focus on their 1s. This means to Player 1 telling which of their cards are 1s.
We can give a hint like "H1" to Player 1, indicating which cards are 1s. Since info is maxed, no more hint tokens needed.
Therefore, our best initial action is to give Player 1 a hint about 1s to help them play their 1s early.
Standard Hanabi
With the promise of the preliminary results, one first effort to scale-up the training runs was made with the following considerations:
- Full Hanabi games can have 30+ turns. With thinking traces, the sequence length of trajectories can be quite large, at least for small-scale experiments. This was particularly costly during training because the training set-up had to use branching rollouts, which is less efficient than the default interleaving strategy. To make longer runs more feasible the thinking traces were dropped and only actions were required.
- The preliminary experiments were made using XML tags to extract the chosen action from the model’s response. However, models like Qwen3 0.6B and 1.7B are non-instruct models and often fail to output a well-formatted response. Instead of using this, the Qwen3-4B-Instruct-2507 was used with tool-calling, which allowed me to bypass format reward signals during training.
Environment
The environment can be found here, in Prime Intellect’s Environment Hub. It has the following structure:
hanabi/
├── config.py # GameConfig dataclass with game constants
├── prompt.py # System prompt template
├── utils.py # Card utilities and game state helpers
├── player.py # Player class with action methods and API calls
└── hanabi.py # HanabiEnv environment, observation generation, and reward
I won’t go into detail in all of it, but I’d like to point at least a few things:
The Player Class
The Player Class represents a player in the game. It’s main method is take_turn, which literally contains the logic to generate the player’s action and execute it. The key logic is contained in the following lines:
# Get response using the verifiers' method (handles logprobs, token extraction)
response = await self.env.get_model_response(
state=state,
prompt=player_messages,
oai_tools=self.env.oai_tools,
)
and:
# Record trajectory with proper token data for training
await self.env.add_model_response(state, prompt_messages, response)
These are responsible for getting a response from the model and record it as a trajectory, which will be used for training.
The rest of the logic in the class in charge of actually executing the action returned by the model (i.e. play a card, discard, or provide a hint to another player).
The HanabiEnv Class
The actual environment is defined in the HanabiEnv class. It extends verifiers’ StatefulToolEnv and contains the logic for Hanabi at the environment level.
The most important method here would be the env_response. In this method, the environment iterates over each player and calls the take_turn method mentioned above.
One thing to note is that, by default, the verifiers library already handles the logic to get and add the model’s response from the default model. In the case of this implementation, such model plays the role of the first player. Then we iterate over the rest of the players, since verifiers does not yet have native logic to handle multi-agent orchestration. What this entails is that the code ends up being slightly awkward, having a blurry line separating the environment from the agent.
Finally, since Hanabi is an environment where players have asymmetric information, the class has a get_observation method, which generates state observations from each player’s perspective. That is, to show to each player the hands of all the other players, but not themselves.
Baseline
With this implementation, I went for a much more extensive evaluation of proprietary models on the environment. The following table shows the top-5 models from the leaderboard, constructed with Prime Intellect’s super useful eval tool using the default of 5 examples x 3 rollouts:
Other models I evaluated included:
prime-intellect/intellect-3at 6.733google/gemini-3-flash-previewat 5.667openai/gpt-4.1-miniat 3.467 (which I had already mentioned above)
Importantly, the base score from Qwen/Qwen3-4B-Instruct-2507 was a mere 1.067. But now onto the good bits…
Synthetic data and SFT warm-up
Using the information from above, I generated a synthetic dataset of rollouts using Grok4-fast, which had the best cost-to-performance ratio from the models I tried. The dataset can be found here – it contains quite diverse rollouts, ranging from scores in the lower 2-5 range to some examples with scores as high as 21 (amazing, by Hanabi’s own standards). It has both rollouts from both player’s perspectives.
Although the Qwen3-4B-Instruct model already fairly well in aligning with the required tool-calling format, I decided to run a short SFT warm-up run to expose the model to examples and have a better baseline from which to jump up.
After the SFT run, the model achieved a score of ~2.9, enough to start RL runs on top of that.
RL
With the SFT’d baseline, I ran an RL experiment split in two. The first, for a 100 steps, which reached a mean score of ~5.1, already above GPT-4.1-mini.
The second run did 120 steps more, and with that the model reached a mean score of ~8.4, above models like Gemini3-Flash and Prime Intellect’s Intellect-3.
After the runs, the leaderboard looked like this:
The final leaderboard can be seen in the environment’s leaderboard tab.
Overall, the trained models showed fairly good performance for relatively short runs, but one main challenge surfaced with the experiments: exploding context length. At the start of the first training run the sequence length used was roughly 5k, whereas by the end of the second run, it was reaching levels around 13k. Similarly with the number of turns in the rollouts: an increase from a mean of ~30 turns per rollout to more than double at ~70 turns per rollout by the end of the training.
This challenge was all the more pressing considering that, as I mentioned above, the runs were made using the less efficient strategy of branching trajectories and that, naturally, having 2 players generating such trajectories meant that the effective sample size seen per training was a lot smaller than recommended configs, leading to noisier reward signal during training. It was clear that scaling-up the RL runs would lead to higher performance, but at the cost of learning from shorter, snappier experiments in smaller, simpler setups.
Before that, however, another set of experiments was made with Prime Intellect’s new Hosted Training Platform.
Hosted RL
Prime Intellect’s new hosted training platform facilitates training setups by abstracting away most of the infra work you need. It manages multi-tenant LoRA deployments for inference and training, that allows much more efficient and much cheaper training runs, by leveraging shared hardware across other runs.
Following the same logic as the training runs above but with much more leeway for larger batch-sizes and longer runs, a Qwen/Qwen3-235B-A22B-Instruct-2507 model reached a mean score of nearly 10 points with no signs of stopping at ~350 steps.
Although this is not much more than the mean score reached in the previous run, it already places the model at the level of Gemini 3 Pro on the leaderboard, with a much more stable run and a much lower budget.
Leveraging the platform’s checkpointing functionality, I extended the run until it reached 500 steps, and a mean score of ~12.2, placing it near the top of the leaderboard.
Tiny Hanabi
Though it’s clear scaling-up the RL training runs would saturate the environment, I wanted to prove the hypothesis much faster and much more cheaply, given that this work is oriented towards informing better abstractions for multi-agent designs, and not so much to saturate the Hanabi env itself.
To do this, I implemented a tiny Hanabi, or, rather, I modified the environment to accept for different configurations of games in terms of colors, ranks and hand-sizes, and finally settled on a specific config with 3 colors, 3 ranks, and a hand-size of 2 for each player. Such config meant shrinking the game’s state-space considerably1, allowing for smaller training runs.
Also, to go one step further in the simplification, I went back to the XML-version of the game (no tool-calling) that meant it was easier to work with the smaller 0.6B and 1.7B non-instruct versions of Qwen3 again. Such environment can be found here.
Synthetic data and SFT warm-up
This time the synthetic data generation was done in two steps. First, as with the previous effort, a first dataset was generated using Grok4-fast. Unlike the previous dataset, this is composed of either perfect or near-perfect scores (for tiny Hanabi).
However, it contains only ~600 rollouts (i.e. 300 games, with 2 rollouts per game, one per player). In order to have a better coverage and higher-quality dataset a second dataset was generated in an expert-iteration loop. That is, I first ran a Qwen3-4B through SFT+RL, and then used the trained model to generate the second set of more comprehensive rollouts.
To keep things short, I will show the results of the training runs that used this second dataset. To give perspective, a Qwen3 1.7B model SFT’d on this data for 20 steps already managed to achieve a score of ~3.1, about 50% of the total tiny Hanabi score, unlike the case for standard Hanabi where the SFT only bumped the score to about 12% of the total.
RL
An RL run over this SFT’d baseline quickly raises the model to top, reaching mean scores of 5.5 (out of 6); that is, 91% of the total score.
Demo
I vibe-coded a simple demo space where it’s possible to play tiny Hanabi with the trained model:
You will notice two things:
- It is still relatively hard to get a perfect score, even on tiny Hanabi.
- The trained model has its own quirks. As expected, it developed its own communication conventions! For instance, I noticed it will generally respond to color hints as indications that a card is playable. This is both good and bad: as mentioned above, it’s probably a result of self-play. It learnt to play certain strategies and get high scores but does not adapt to other types of play.
Multi-Agent Designs
Multi-agent setups are obviously not limited to the kind of case that is covered in the previous experiments. Among many other considerations, a comprehensive taxonomy would have to include things like:
- Reward structure - whether agents cooperate fully, like in Hanabi, or interact in a partial or full competitive environment.
- Information - do agents observe the state of the game fully, or is it only partially observable, like in the case of Hanabi.
- Training paradigm - are agents trained jointly, with the same policy (i.e. self-play) like in Hanabi, or are there multiple-policies (potentially one per-agent)?
- Communication - can agents communicate, and if so, how?
In this work, the scope is limited to a fairly small space of potential interactions, but even with this in mind, some extensions over verifiers’ abstractions can already be put in place. Importantly, the kind of training that was used in the previous experiments was pure joint-policy self-play. This avoids a lot of the complications arising from training multiple-policies. Furthermore, Hanabi’s fully cooperative nature implies one does not need to care about per-agent rewards, again, simplifying the logic to a great extent.
These considerations imply most of the changes needed to have a more ergonomic implementation of environments like Hanabi can be done on the verifiers side. A (draft) PR with a minimal set of changes can be found here. I will briefly explain the most important features here.
Agent class
One of the most evidently awkward pieces of code in the original Hanabi implementation was the Player class. This class was needed to abstract away the logic of what each agent does, but should ideally be a shared artifact in the verifiers library.
In the changes proposed, the Agent class is a a dataclass that representationally isolates each agent in a multi-agent env. It contains only essential attributes like an identifier for the agent, and system prompt, potentially specific to the role it takes in the environment.
Protocol
The other essential piece is the Protocol class, which defines how the multiple agents interact in the environment. It specifies turn order and agent interaction patterns, separate from the task/environment logic. This allows the same protocol (e.g., round-robin) to be reused across different tasks.
MultiAgentEnv
Finally, the MultiAgentEnv extends MultiTurnEnv to integrate the Agent and the Protocol into the class that manages the turn-order logic, environment logic, and handling of each agent’s trajectories.
Refactored Hanabi
With the proposed changes above, the implementation of the Hanabi enviornment can be found here and extends the MultiAgentEnv with just some key lines, like the following:
for i in range(num_players):
agent = vf.Agent(
id=f"player_{i}",
system_prompt=generate_system_prompt(
self.config, player_id=i, num_players=num_players, thinking=thinking
),
is_trainable=True,
)
self.register_agent(agent)
where Agent instances are defined, and:
super().__init__(
dataset=train_dataset,
eval_dataset=eval_dataset,
max_turns=max_turns * num_players, # scale by number of players
protocol=vf.RoundRobinProtocol(agent_ids),
**kwargs,
)
where the MultiAgentEnv is instantiated with e RounRobinProtocol.
The rest of the environment logic is defined in the method on_turn_complete which lives in MultiAgentEnv and is defined to update the state of the environment after an agent’s actions.
Signaling and Coordination
To put the new abstractions to the test, I decided to implement two more environments, tightly related to Hanabi’s core dynamics: a Lewis Signaling Game and a pure Coordination Game.
Lewis signaling games
A Lewis Signaling game is a type of signaling game that features perfect common interest between players, and where their success is determined by their capacity to form communication conventions. Both these features – emergent communication and common payoffs – is shared with Hanabi.
From Wikipedia:
The underlying game has two players, the sender and the receiver. The world can be in any of a number of states and the sender is aware of that state. The sender has at its disposal a fixed set of signals that it can send to the receiver. The receiver can observe the signal sent, but not the state of the world, and must take some action. For each state, there is a unique correct action and both the sender and receiver prefer that the receiver take the correct action in every state.
The extensive form representation of the game is the following, which illustrates the scenarios of common-payoff:
In my implementation, the “state” of the world is a (real) word (e.g. “apple”, “horse”). The sender sees this and a set of “distractors” from the same set of real words. The sender can pick a message from a vocabulary of alien words, meaningless strings like “snorf” or “blun”. The receiver then sees the real words (target and distractors) and must pick one based on the message. If it picks the target then both get a reward of 1, otherwise they get 0.
What is interesting about this environment is that no model – no matter how good – could get a high reward on this environment. There’s no semantic relation at all between the vocabulary of real words and the vocabulary of alien words. Success depends entirely on the ability for the models to learn a way to communicate about the state of the world through iterative experience; that is, forming a mapping between real and fake words.
It is also very easy for models to get quite good at this task. In the default setting, of 32 words, a quick RL run yields the following reward curve:
The following is the mapping2 learnt by the model during such training:
============================================================
LEARNED MAPPING SUMMARY
============================================================
Target Alien Word Count Confidence
------------------------------------------------------------
apple plax 6 60%
banana frag 6 60%
cherry frag 6 60%
grape frag 10 100%
lemon frag 10 100%
mango munt 10 100%
orange zorp 10 100%
peach frag 10 100%
book hask 5 50%
chair blun 8 80%
table plax 10 100%
lamp plax 10 100%
clock hask 10 100%
mirror glont 10 100%
window glont 7 70%
door blun 8 80%
bird frin 6 60%
cat blun 9 90%
dog blun 10 100%
fish fump 8 80%
horse werg 10 100%
mouse blun 10 100%
rabbit blun 7 70%
tiger jund 10 100%
cloud glont 6 60%
river gwim 9 90%
mountain munt 10 100%
forest glont 10 100%
ocean glont 4 40%
desert dask 10 100%
island yolt 10 100%
valley yolt 10 100%
============================================================
COLLISION ANALYSIS
============================================================
Alien words mapping to multiple targets:
blun: chair, door, cat, dog, mouse, rabbit
frag: banana, cherry, grape, lemon, peach
glont: mirror, window, cloud, forest, ocean
hask: book, clock
munt: mango, mountain
plax: apple, table, lamp
yolt: island, valley
Unused alien words (18): blick, blorf, criv, drix, frub, grix, holt, jask, kreb, kwist, quib, snorf, spax, threp, trel, vosh, yump, zang
Pure-coordination games
Coordination is also an core feature in Hanabi. In game theory, a coordination game is where players get higher payoffs when they select the same action as other players. They are not generally required to be common payoff, however. For example, the Stag Hunt game has the following payoff matrix:
| Hare | Hare | |
|---|---|---|
| Stag | a, a | c, b |
| Stag | b, c | d, d |
where \(a \gt b \geq c \gt d\)
Nature and Appearance of Deer - from the Wikipedia entry, just because it's cool.
The specific kinds of coordination games that have common-payoffs are called pure coordination games, and have the following shape:
| B1 | B2 | |
|---|---|---|
| A1 | a, a | b, b |
| A2 | b, b | a, a |
Again, in this kind of environment, there’s nothing that can make models particularly good at coordinating if there’s absolutely no correlation between their actions. In the implemented environment their actions are random 4-character letter strings with no inherent meaning so the key challenge is that actions have no prior focal points. The agents are faced with these actions:
afqo, nzjo, usie, fpva, kafn, vqwp, wpus, yicc, qahf, trxc
Models very quickly attain perfect rewards.
Out of those, the model converged, for no particular reason, on fpva. Of course, this is not surprising, the first instance that, by chance, coordinates the players will be reinforced until it is the default option by both players.
Although these two simple environments are not particularly interesting per-se, they serve two purposes:
- disentangle some of the main elements of Hanabi: coordination and emergent communication
- as proofs/tests for the new proposed changes in verifiers
Directions
In this work, I have shown that it is possible to implement simple, yet interesting, multi-agent environments using Prime Intellect’s stack, and that even small-scale RL runs can yield meaningful improvements in cooperative play. There’s a long path towards general support for the wide range of multi-agent interactions, but in the spirit of cooperation and open research that Prime Intellect is setting a clear precedent of, here are what I consider the most promising directions, which I plan to work on.
Multi-policy training
All the experiments in this work used self-play, meaning the same policy controls all agents. This is the simplest setup and it naturally leads to the emergence of conventions, but it also means those conventions can be arbitrary and brittle. This is actually pretty evident, after interacting with the Tiny Hanabi Demo for a few games! A natural next step would be training with multiple distinct policies rather than a single shared one. Most likely, this could be done in a memory-efficient way separate LoRA adapters for each policy. This opens up a much richer design space — asymmetric roles and heterogeneous capabilities— but also introduces challenges around credit assignment and non-stationarity that the current verifiers+prime-rl abstractions don’t yet address.
Scaling up
The context-length explosion observed in the standard Hanabi runs is a practical bottleneck but not a fundamental one. Longer runs with larger batch sizes and more efficient trajectory handling (i.e. leveraging the recent best-effort interleaving). The hosted training results already hint at this: the 235B MoE model reached ~12 points –placing it almost at the top of the leaderboard – with a stable run, modest budget and no signs of plateauing.
Competitive setups
The environments explored here are all common-payoff, which sidesteps the thorny issues of credit assignment and incentive alignment. Mixed-motive games are arguably more representative of real-world interactions and would stress-test both the training infrastructure and the proposed abstractions.
Final word
I believe multi-agent training for LLMs is one of the more under-explored and high-potential areas in open RL research. I think the experiments I showed here suggest there is a lot of low-hanging fruit still to be picked. I hope this work — the environments, the datasets, and the proposed abstractions — provides useful building blocks for others exploring this space, and I would like to thank the Prime Intellect team for giving me the opportunity to do so.
-
I haven’t done the math here myself. Opus 4.6 seems to claim that it’s a state-space reduction of roughly 17+ orders of magnitude, with the following comparison table:
Metric Standard (5c/5r/h5) Tiny (2c/3r/h2) Reduction Card types 25 6 ~4x Deck size 50 12 ~4x Action space 20 9 ~2x State space ~10³⁰⁺ ~10¹³ ~10¹⁷x smaller In any case, I found this cool website that explains the complexity of games with a citation to this paper, claiming:
-
These are the alien words chosen by the sender when presented with a specific target over 10 trials. ↩