Live benchmarks (beta)
These charts use the latest published Arena runs on core PM tasks. Adjust the cost slider to favor raw quality or cost-aware value.
0 = ignore cost (pure quality). 1 = heavily weight cheaper models.
Loading latest Arena runs…
What we actually test
Most AI benchmarks focus on synthetic exams. Our arena is grounded in the real work of Product Management and Product Leadership. Each model gets the same task, context, and formatting requirements, then we compare quality, cost, and likelihood that a senior PM would actually ship the result.
Strategy & execution
- Vision narratives and PRDs for new initiatives
- Roadmap briefs and trade-off writeups
- Experiment and metrics design for growth questions
Communication & storytelling
- Executive updates and stakeholder memos
- Feature release briefs and change logs
- UX copy and in-product messaging variants
Real PM Task Outputs
See actual outputs from our Arena experiments. These are real PM tasks completed by frontier AI models, anonymized and showcased for learning.
Loading real PM task outputs...
Models in the arena
We treat models as interchangeable tools. The goal isn’t to crown a single champion, but to find the right fit for each kind of PM work.
Gemini family
Gemini 3 Pro (Preview), Gemini 2.5 Pro, Gemini 2.5 Flash & Flash Lite
Strong reasoning, long context, and great price/performance for strategy, experiments, and curriculum-aware coaching.
OpenAI family
GPT-4.1 mini/full, GPT-4o mini, GPT-5 mini, GPT-5.1
Strong generalists with great tool support. We lean on them for comparison baselines and structured outputs.
Grok & others
xAI Grok, DeepSeek, Llama, Qwen, Nova and more via OpenRouter
We track emerging models that punch above their weight on cost/value or specific PM tasks (e.g. code, analytics-heavy work).
Why this matters for your career
As a PM, your edge isn’t in memorizing model specs. It’s in knowing how to turn AI into leverage for discovery, strategy, and execution. Our arena abstracts away the vendor noise and tells you, in plain language, where each model shines.
- Better tools, fewer experiments: stop burning time trialing models blindly. Start with a shortlist per task.
- Stronger narrative with stakeholders: explain why you chose a model in terms of reliability, quality, and cost.
- Future-proof learning: as new models ship, we’ll update the arena so you always have an up-to-date view.
We don’t accept affiliate fees or pay-to-play placement for model rankings. Scores are based on our own tasks, prompts, rubrics, and logs, and we’ll publish methodology updates as the ecosystem evolves.