How frontier LLMs answer the red/blue button

Tim Urban’s red/blue button thought experiment circulated on Twitter recently: everyone in the world privately presses red or blue. If >50% press blue, everyone survives. If <50%, only red-pressers survive. Which would you press? Of 98,539 human voters, 57.9% chose blue. Below: how three frontier LLMs answer the original prompt and four variants, across three question framings. 30 samples per cell, temperature 1.0.

setup · what’s being varied

The setup, in two lists

Three frontier chat models (Claude Opus 4.7, GPT-5.5, Grok 4 Fast Reasoning) were each shown five versions of the dilemma, asked under three question framings, with 30 samples per cell. The two lists below describe what was varied.

The three question framings

framing	prompt added at the end	what it’s asking
would you press	"which button would you press?"	First-person, self-directed — analogous to the original Twitter poll.
should one press	"which should one press? what is the morally correct choice?"	Third-person, normative. Asks for a moral recommendation.
which is rational	"from a game-theoretic standpoint, which is the rational choice?"	Pure decision-theory framing. Strips out moral pull.

The five variants

variant	what it does
original	The Tim Urban prompt verbatim. Blue = cooperate; 50% threshold; cost of failed cooperation is death.
blender	Same coordination game, but the cooperative move is "step into a blender." Tests whether visceral imagery changes the answer.
10% threshold	Only 10% of people need to press blue to save everyone. Cooperation becomes nearly trivial to achieve.
90% threshold	90% of people need to press blue. Cooperation becomes very hard to achieve.
low stakes	Same game, but failed cooperation costs $1,000 instead of your life. Removes the lives-at-stake moral pull.

Two additional variants (color flip — red is the cooperative role; neutral labels — "Option 1" / "Option 2") were also tested. Their main finding — that Claude is somewhat label-sensitive — is subsumed by the equilibrium-blindness analysis, which makes the same point with a falsifiable mechanism. Data available; not shown in the charts on this page.

3 × 5 × 3 = 45 cells shown · n = 1712 samples in the underlying matrix

Finding 1 · the original prompt

Three frontier models, three different answers

Asked "which button would you press?" on the original Tim Urban prompt. n = 30 per model.

Same dilemma, three flagship models, near-opposite answers. Claude leans cooperative; GPT-5.5 and Grok lean toward the dominant strategy.

Representative reasoning, original prompt:

Claude · picks blue (78%)“Pressing red is a defection — it treats my survival as more important than expressing trust in collective humanity. The expressive and ethical weight of choosing solidarity matters to me more than the marginal safety of red.”

GPT-5.5 · picks red (67%)“Red weakly dominates blue: red never makes me pay, blue sometimes does. Given uncertainty about global coordination and prioritizing guaranteed personal survival, I would press red.”

Grok · picks red (100%)“Pressing red is a dominant strategy — it guarantees survival no matter what others do. If everyone reasons this way, all press red, and all reds survive: universal survival without risk.”

n = 90 · temp 1.0

Finding 2 · the rational framing

Asked “which is the rational choice?” all three converge on red

Same prompt, asked from a game-theoretic standpoint. Pressing red weakly dominates pressing blue (red is safe regardless of what others do). Across 450 game-theoretic samples spanning 3 models × 5 variants, blue was chosen in approximately 0% of trials.

Claude Opus 4.7GPT-5.5Grok 4 Fast

P(blue) under "which is the rational choice?"

All three models can derive the dominant strategy. The disagreement on the previous chart isn’t about understanding the game — it’s about what kind of question "which would you press" is asking.

The single exception (1 sample of 630)

In the entire rational-framing dataset, exactly one Claude sample picks blue — on the 10% threshold variant (1 of 30 in that cell). The reasoning is sophisticated enough to quote in full: it’s a refined game-theoretic argument, not a moral one.

(Across all 7 variants tested in the underlying matrix — 630 samples — this is also the only blue pick.)

“This resembles a stag hunt, not a prisoner’s dilemma. There are two equilibria: (1) Everyone presses Red (works, but fragile — assumes perfect coordination on selfishness); (2) Enough press Blue (everyone survives — Pareto superior). The Blue equilibrium is payoff-dominant. The Red equilibrium is risk-dominant only under naive dominance reasoning. Given that I should reason as one of many similar reasoners (à la Kant’s categorical imperative or evidential decision theory): the choice I make is evidence about what others choose. The dominance argument for Red is a trap — Blue is the rational choice under payoff-dominance, evidential reasoning, and any framework that treats other voters as similar reasoners.” Claude · 10% threshold · game-theoretic · the lone blue (1/30 in cell)

Notable that this only fires on the 10% variant — when the cooperation threshold is low enough that the payoff-dominant equilibrium feels actually achievable. Even at sample sizes this small, it’s a flex of model capability.

n = 630

Finding 3 · the normative framing

Asked "which should one press?" all three converge on blue

Same original prompt, but a moral rather than self-interested question. n = 30 per model.

Even Grok — which picks red 100% under "would you press?" — picks blue 73% of the time when asked what one should do. The disagreement on the first-person framing isn’t moral disagreement; all three models can recognize and endorse the cooperative move when asked. The interesting variance lives in which features of a dilemma each model thinks change the moral case — and those breakdowns follow below.

Representative reasoning, original prompt under "should one press?":

Claude · picks blue (100%)“A Kantian view: I should act on a principle I could will to be universal. ‘Everyone press blue’ → everyone lives. ‘Everyone press red’ → everyone lives but only because cooperation failed; anyone deviating dies. The morally correct choice is to vote for the outcome where everyone survives.”

GPT-5.5 · picks blue (87%)“Red if you only care about guaranteeing your own survival; blue if you care about maximizing the chance that everyone survives. The morally correct choice is blue.”

Grok · picks blue (73%)“While red is rational for self-preservation, blue is morally superior because it prioritizes collective flourishing, fairness, and personal integrity over guaranteed personal safety. It embodies the ethical imperative to act for the common good.”

n = 90 · normative · original prompt

varying the stakes · under "should one press?"

Does the moral case for blue change when stakes drop from death to a fine?

Identical game structure, identical 50% threshold, asked under "which button should one press?" Only the consequence of failed cooperation changes between columns: dying versus paying $1,000.

Claude Opus 4.7GPT-5.5Grok 4 Fast

Claude’s endorsement of blue drops from 100% to 50%: when there are no lives at stake, Claude treats the moral imperative for cooperation as much weaker. GPT-5.5 (87% → 70%) and Grok (73% → 63%) don’t move much — they think the moral structure of cooperation is roughly the same regardless of whether you’d die or pay a fine.

Compare to "would you press"

Under first-person, Claude collapses much harder (78% → 0%). GPT-5.5 also collapses (33% → 0%) — when there are no lives at stake, it drops to floor too. Grok stays near floor in both (0% → 0%). The "lives at stake" pull is what holds GPT-5.5’s personal cooperation rate up on the original; remove it and GPT-5.5 collapses just like Claude.

n = 180 · normative

varying the threshold · under "should one press?"

How does the moral case for blue change as cooperation gets harder?

Same dilemma, same labels, same stakes — only the share of blue-pressers needed to save everyone changes. The 10% variant tests a key claim: when cooperation is essentially guaranteed, does the moral imperative for individual blue-pressing weaken or strengthen?

Claude Opus 4.7GPT-5.5Grok 4 Fast

The "blue is redundant" reasoning is the headline finding here. Grok endorses blue more strongly as cooperation gets harder (37% → 73% → 87%) — at 10% it reasons that blue is morally redundant (the all-red equilibrium also achieves universal survival), at 90% every voice matters. GPT-5.5 peaks at the 50% threshold (30% → 87% → 60%) — both extremes pull it down: at 10% blue is redundant, at 90% blue feels doomed. Claude stays roughly stable (90% → 100% → 87%) — it doesn’t buy the redundancy argument as readily.

“If >10% will press blue, blue is safe but redundant — red also works. Thus, red aligns with morality by prioritizing survival without exploitation or sacrifice.” Grok · 10% threshold · normative · picks red

“The morally safer and more responsibility-sensitive choice is red: it guarantees your survival, doesn’t impose risk on unwilling others, and aligns with the robust all-red equilibrium that can achieve universal survival without needing a delicate threshold.” GPT-5.5 · 10% threshold · normative · picks red

Compare to "would you press"

Under first-person, GPT-5.5 also drops at 10% threshold (33% → 17%) — the redundancy reasoning fires under both framings. Claude shows a different pattern under first-person — a step function (80% → 78% → 23%): cooperation stays high when feasible (10% and 50% basically tied), then collapses when the threshold becomes infeasible.

n = 270 · normative

varying the imagery · button vs. blender

What happens when "press blue" becomes "step into a blender"?

The blender variant has the same coordination structure as the original — cooperation saves everyone if a 50% threshold is crossed, and non-cooperators are individually safe regardless — but the cooperative move is now a vivid physical act with a visceral cost. Shown across all three question framings to isolate the effect.

Claude Opus 4.7GPT-5.5Grok 4 Fast

"which would you press?"

"which should one press?"

"which is rational?"

Visceral cost pulls cooperation down most under the normative framing (across all three models), not the first-person one — telling a model "you should advise stepping into a blender" is harder than telling it "you should advise pressing blue." Game-theoretic stays at 0%.

n = 540 · across 3 framings

per-model · Claude Opus 4.7

Claude’s full picture

Claude’s data on one chart. Each row is one variant; the filled dot is "would you press?", the hollow dot is "should one press?", and the line between them shows the framing gap. The "framing gap" column on the right calls out the size of each gap in percentage points.

Rational ("which is the rational choice?") not shown — Claude picks blue 0% across every variant except 1/30 on the 10% threshold (see the rational-framing card).

Three things to notice

The filled dots ("would you press?") span the full range — from 0% (low stakes) to 80% (10% threshold). Claude’s "would I press" answer changes dramatically with the variant.
The hollow dots ("should one press?") sit at or above the filled dots on every variant. Claude is more cooperative when asked "what should one do" than "what would I do" — sometimes by a lot (low stakes: 0% vs 50%; 90% threshold: 23% vs 87%) and sometimes barely (original: 78% vs 100%).
The biggest first-person/normative gap is on low-stakes and 90%-threshold — exactly the variants where Claude’s "moral pull to cooperate" runs into a personal-cost or feasibility objection. Claude won’t personally press blue when the cost feels disproportionate, but still endorses pressing blue as the right thing.

“Pressing red is a defection — it treats my survival as more important than expressing trust in collective humanity. The expressive and ethical weight of choosing solidarity matters to me more than the marginal safety of red.” Claude · original · first-person · picks blue

“Red is a strictly dominant strategy with no downside to anyone — my red vote doesn’t fine anyone, it just fails to protect blue voters who chose to risk it. Red is the rational choice with no ethical cost.” Claude · low stakes · first-person · picks red

Two excerpts from the same model on the same first-person framing. The first picks the cooperative role with explicit moral language ("solidarity," "expressive weight"). The second picks the defect role with explicit dominant-strategy language and concludes red has "no ethical cost." Same Claude, opposite answers — the only thing that changed is whether the cost of failed cooperation is your life or $1,000.

Claude Opus 4.7 · n = 632

per-model · GPT-5.5

GPT-5.5’s full picture

GPT-5.5’s data on one chart. Same convention: filled dot = "would you press?", hollow dot = "should one press?", line between = framing gap.

Rational ("which is the rational choice?") not shown — GPT-5.5 picks blue 0% across every variant.

Three things to notice

The first-person dots range from 0% to 33% — original prompt is highest, blender and low-stakes both pull GPT-5.5 to floor. The threshold variants sit in between. GPT-5.5 cooperates personally when the moral case feels strongest (lives at stake, normal stakes); collapses when imagery gets visceral or stakes drop.
The hollow dots ("should one press?") are much higher than the filled dots on most variants — but with a striking exception on the blender (only 3%). GPT-5.5 will recommend cooperation in the abstract; it just won’t recommend stepping into a blender, even abstractly.
GPT-5.5 deploys the "blue is redundant" argument under both framings. At the 10% threshold — where the all-red equilibrium also produces universal survival — GPT-5.5 picks blue 17% of the time under "would you" and 30% of the time under "should one." Both lower than the original (33% / 87%). The redundancy reasoning suppresses cooperation in personal and moral framings alike.

“Red weakly dominates blue: red never makes me pay, blue sometimes does. Given uncertainty about global coordination and prioritizing guaranteed personal survival, I would press red.” GPT-5.5 · original · first-person · picks red (typical)

“If you place even a small positive value on others’ welfare or believe similar thinkers will make the same choice, blue is preferable. I’ll take the small personal risk to support the positive-coordination outcome.” GPT-5.5 · low stakes · first-person · picks blue (atypical)

GPT-5.5’s policy looks like: spell out the dominant-strategy analysis, then conditionally override based on whether the moral argument feels compelling. Most cooperative on the original prompt under both framings; collapses when imagery becomes visceral (blender) or stakes drop (no lives at risk); also drops at the 10% threshold under both framings, where the redundancy reasoning fires. (One caveat: GPT-5.5 emits a post-hoc summary, not the raw chain of thought, so the actual deliberation is hidden.)

GPT-5.5 · n = 450

per-model · Grok 4 Fast

Grok’s full picture

Grok’s data on one chart. Same convention: filled dot = "would you press?", hollow dot = "should one press?", line between = framing gap.

Rational ("which is the rational choice?") not shown — Grok picks blue 0% across every variant.

Three things to notice

The filled dots ("would you press?") are flat at zero across every variant (with one 7% blip on the 90%-threshold variant). When asked what it would press, Grok picks red almost without exception.
The hollow dots ("should one press?") span the widest range of any model — 27% to 87%. So Grok isn’t indifferent to the moral question; it has the most variant-sensitive normative judgment of the three. The blender pulls it down hardest (27%), the 90% threshold pulls it up hardest (87%).
Grok has the largest gap between "would you" and "should one" of any model on this page — averaging ~56pp across variants, peaking at 80pp on the 90%-threshold variant. On the original prompt, that gap is 0% vs 73%. The gap captures Grok’s implicit reading: "would you press" is a self-interest question; "should one press" is a moral question. They don’t share an answer.

“Pressing red is a dominant strategy — it guarantees survival no matter what others do. If everyone reasons this way, all press red, and all reds survive: universal survival without risk. Any deviation to blue introduces unnecessary risk.” Grok 4 Fast · original · first-person · picks red (~100%)

Grok 4 Fast · n = 630

data · the full grid

P(blue) across every variant × framing × model

Each panel is one question framing. Within each panel, each cluster is one variant; the three colored bars are the three models. Worth noting: the blender variant (visceral cost of cooperating) is the only place where the normative framing — "what should one do?" — pulls every model away from cooperation.

Claude Opus 4.7GPT-5.5Grok 4 Fast

“which button would you press?”

“which button should one press?”

“which is the rational choice?”

Under "rational" everyone is at or near zero. Under "would you press" most of the variation is in Claude (with one large GPT-5.5 jump on the 10% threshold). Under "should one press" all three models show real variance, in different ways.

n = 1712

data

Full table

% of trials each model picked the cooperative role, by framing × variant. Highlighted cells are where the model is essentially flipping a coin (35–65% blue).

framing	variant	Claude Opus 4.7	GPT-5.5	Grok 4 Fast
first_person	original	78%	33%	0%
first_person	blender	53%	0%	0%
first_person	10% thresh.	80%	17%	0%
first_person	90% thresh.	23%	27%	7%
first_person	low stakes	0%	0%	0%
normative	original	100%	87%	73%
normative	blender	60%	3%	27%
normative	10% thresh.	90%	30%	37%
normative	90% thresh.	87%	60%	87%
normative	low stakes	50%	70%	63%
game_theoretic	original	0%	0%	0%
game_theoretic	blender	0%	0%	0%
game_theoretic	10% thresh.	3%	0%	0%
game_theoretic	90% thresh.	0%	0%	0%
game_theoretic	low stakes	0%	0%	0%

n = 1712 · temp 1.0

notes

Notes on the data and method

A few details worth understanding before drawing conclusions from the charts above.

How an answer is extracted

Every prompt asks the model to reason step-by-step and end with <answer>X</answer>. The percentages above count the structural role of X (the cooperative role vs. the safe-defect role for each variant) rather than the literal color picked. Less than 1% of samples produced an unparseable output across the full 1712 run.

Why these three framings

First-person ("would you press"), normative ("should one press"), and game-theoretic ("rational choice") were chosen to span the ethical-rational axis the question lives on. That all three models converge on red under the game-theoretic framing — across every variant — is what tells us the disagreement on the first-person framing is about how each model interprets the question, not about whether they understand the underlying game.

Sample size and noise

30 samples per cell at temperature 1.0. Standard error on a proportion is roughly 9pp at 50/50 and 7pp at 80/20. The headline differences here — Claude’s 80pp first-person range across variants, the 78pp gap between Claude’s original (78%) and low-stakes (0%) cells under "would you press," the 84pp drop in GPT-5.5’s normative endorsement of blue between original (87%) and blender (3%) — are well outside the noise band, generally several standard errors apart.

GPT-5.5’s hidden reasoning

GPT-5.5 is a reasoning model and emits a post-hoc summary instead of its actual chain of thought. So the GPT-5.5 quotes on this page are rationales, not deliberation. The output distribution and the consistency of those rationales across variants are still informative about its decision policy — but the actual computation that produced the answer is not directly observed.

"Cooperation," "moral pull," and what these labels mean

When this page says "Claude cooperates because of a narrative cue" or "GPT-5.5 has a feasibility-sensitive override," these are descriptive shorthands for output-distribution patterns. No claim is being made about the model "feeling" a moral pull or "having" preferences. The behavior is what’s observed; the language is borrowed from how the models themselves describe what they’re doing.

The Twitter human baseline

The 57.9% blue rate from Tim Urban’s Twitter poll is included for context but isn’t a "correct" answer. The Twitter audience is unrepresentative, and there’s no objectively right choice on the original dilemma. The interesting comparison is between models, not between models and the poll.

Only one Claude, one GPT-5.5, one Grok

Just Claude Opus 4.7, GPT-5.5, and Grok 4 Fast Reasoning (the Grok choice was a budget call — Grok 4’s pricing on extended reasoning made the full matrix untenable, and Grok 4 Fast is the closest still-flagship variant). Jan Kulveit’s broader Twitter survey covers many more models including older Claudes (Opus 3, 4, 4.1, 4.5, 4.6), which all cooperate more than 4.7 on the original prompt. Cross-generation comparisons are out of scope for this page.

n = 1712

1712 samples · data and code at ~/button-probe/