new · Live HTML Preview Rendering Evals active

The World's #1
Vibe Coding Benchmark

We built the premier evaluation platform for AI models. 15+ frontier models, head-to-head scoring, and an AI judge that grades it all. No vendor hype. Just raw, real results.

3,897 prompts judged·15+ frontier models benchmarked·10 vibe coding categories
~/projects/acme · World of AI Bench
v0.9.3
$bench run --model claude-opus-4-5 --suite full
→ resolved 14 benchmarks · 18,540 graded samples
→ concurrency=64 · est. cost $42.18 · est. time 14m
✓ authenticated · org acme · cluster us-east-2
spawning 64 workers...
✓ mmlu-pro · 12,032 / 12,032 · 88.4
✓ humaneval+ · 164 / 164 · 92.7
✓ swe-bench · 500 / 500 · 71.2
! aime-2025 · 30 / 30 · retried 2 timeouts
✓ aime-2025 · 30 / 30 · 76.1
✓ gpqa · 198 / 198 · 64.8
✓ tau-bench · 500 / 500 · 60.5
→ 3 disagreements flagged for human review
✓ run complete · 14/14 benchmarks · composite 79.4
→ report saved to runs/run_a7f2c.json
$_
15+
FRONTIER MODELS BENCHMARKED
77.4
TOP COMPOSITE SCORE
3.9K
PROMPTS AI-JUDGED
10
VIBE CODING CATEGORIES
GPT-5.5GPT-5.5·77.4
Claude Opus 4.8Claude Opus 4.8·77.3
Claude Opus 4.7Claude Opus 4.7·77.1
Gemini 3.5 FlashGemini 3.5 Flash·76.6
Gemini 3.1 ProGemini 3.1 Pro·76.1
GPT-5.4GPT-5.4·70.5
Claude Sonnet 4-6Claude Sonnet 4-6·69.8
Grok 4.20 ReasoningGrok 4.20 Reasoning·69.6
Qwen 3.6 MaxQwen 3.6 Max·68.5
DeepSeek V4 ProDeepSeek V4 Pro·68.0
Kimi K2.6Kimi K2.6·67.2
DeepSeek V4 FlashDeepSeek V4 Flash·65.2
GPT-5.5GPT-5.5·77.4
Claude Opus 4.8Claude Opus 4.8·77.3
Claude Opus 4.7Claude Opus 4.7·77.1
Gemini 3.5 FlashGemini 3.5 Flash·76.6
Gemini 3.1 ProGemini 3.1 Pro·76.1
GPT-5.4GPT-5.4·70.5
Claude Sonnet 4-6Claude Sonnet 4-6·69.8
Grok 4.20 ReasoningGrok 4.20 Reasoning·69.6
Qwen 3.6 MaxQwen 3.6 Max·68.5
DeepSeek V4 ProDeepSeek V4 Pro·68.0
Kimi K2.6Kimi K2.6·67.2
DeepSeek V4 FlashDeepSeek V4 Flash·65.2
01 · the leaderboard

The whole frontier, ranked by what you actually ship.

Sort by composite score, narrow to coding or reasoning, filter by price or open-weight. Every cell links to the exact test cases - no black boxes, no marketing math.
world-of-ai · leaderboard
live · 3,897 prompts scored
#ModelCompositeTrendFrontendCreativeGame DevAgenticSVG Art
01
GPT-5.5
OpenAI
77.4
76.876.977.078.275.8
02
Claude Opus 4.8
Anthropic
77.3
78.178.076.276.477.4
03
Claude Opus 4.7
Anthropic
77.1
77.477.575.875.976.8
04
Gemini 3.5 Flash
Google
76.6
75.976.876.774.878.2
05
Gemini 3.1 Pro
Google
76.1
75.676.178.376.576.4
06
GPT-5.4
OpenAI
70.5
70.269.870.071.169.6
Pricing

Professional benchmarking. Simple pricing.

Get access to hosted runs, private workspaces, and head-to-head arena modes. Choose the plan that best fits your scale.

MonthlyAnnually Save 17%
Hobby / OSS
$0/forever

For individual developers, hobbyists, and open source contributors.

  • Public leaderboard access
  • "Can I Run It?" hardware calculator
  • Community Discord support
Get started
Enterprise
Custom

For large teams and enterprises requiring custom scales, VPC, and dedicated grading.

  • Secure VPC or On-prem deployment
  • Custom LLM evaluation rubrics
  • Private custom model integrations
  • Team dashboard & workspace sharing
  • Dedicated capacity & custom SLAs
Talk to sales
why World of AI Bench

Numbers you can defend in a design review.

Built by a team that got tired of inconsistent evals, vendor-curated benchmarks, and "trust us, it's better." Three principles, no exceptions.
01CAN I RUN

Instant model compatibility checks.

Paste any model ID and we'll tell you which benchmarks it can run, what it'll cost per prompt, and where it might fail. No guesswork - just data.

$ bench can-i-run claude-opus-4
provider: anthropic
categories: 10/10 supported
est. cost: $0.84 / prompt
✓ ready to benchmark
02PROMPT LIBRARY

3,900+ vibe-coding prompts.

The largest curated prompt set for evaluating frontier models. 10 categories - frontend UI, games, SVG art, creative, agentic, and more. Battle-tested across 15+ models.

Categories3,900+ prompts
Frontend UI
Game Dev
SVG Art
Creative
Agentic
3D Graphics
Data Viz
Animation
Full-Stack
Code Golf
03AI JUDGE

Multi-agent scoring that matches human consensus.

Our AI judge panel uses weighted rubrics to grade functionality, design, code quality, and creativity. 5 dimensions. No cherry-picked evals. Every score is auditable.

judge-panel gemini-3.1-pro
rubric: v3.2 · 5 dimensions
✓ functionality 94/100
✓ design 91/100
✓ code quality 88/100
✓ creativity 92/100
composite: 91.3
THE PLATFORM

Every model. Every category. One dashboard.

Run any frontier model against our 3,900+ prompt library. Get AI-judged scores in 10 vibe-coding categories, compare results head-to-head, and share your benchmarks to the community showcase.

  • Run benchmarks across 15+ frontier models - OpenAI, Anthropic, Google, xAI, DeepSeek
  • AI-powered judge panel scores every response on 5 weighted dimensions
  • Head-to-head arena mode for direct model comparisons
  • Community showcase - share & browse the best AI-generated creations
bench.config.yaml14 lines
# Production model gate - run on every PR
suite: code-suite-v2
fail_if:
composite: < 82.0
swe-bench: < 68.0
latency_p50: > 2.0s
 
models:
- id: claude-sonnet-4-5
- id: gpt-5-mini
- id: deepseek-v3-2
 
benchmarks:
- humaneval+
- swe-bench
- livecodebench
- custom: ./internal/refactor-suite
 
grader: llm-judge
budget: $25
THE CATEGORIES

10 categories, one vibe score.

Industry-standard suites for coding, reasoning, math, agents, and long-context. All graded with identical samplers and contamination checks - so a 71 on SWE-bench means the same thing across vendors.
Frontend UIINTERACTIVE

Full web interfaces - dashboards, forms, landing pages, and component libraries.

520 promptstop: 93.2
Creative WritingGENERATIVE

Story generators, poetry engines, interactive fiction, and narrative tools.

380 promptstop: 90.8
Game DevINTERACTIVE

Browser games - platformers, puzzles, card games, and physics simulations.

450 promptstop: 88.1
Agentic TasksAUTONOMOUS

Multi-step workflows - data pipelines, API orchestration, and planning agents.

340 promptstop: 92.5
SVG ArtVISUAL

Vector illustrations, icons, animated graphics, and generative art pieces.

410 promptstop: 91.9
3D GraphicsRENDERING

Three.js scenes, WebGL shaders, 3D visualizations, and procedural geometry.

290 promptstop: 84.2
Data VizANALYTICAL

Charts, graphs, interactive dashboards, and real-time data visualizations.

380 promptstop: 89.5
AnimationMOTION

CSS animations, canvas effects, particle systems, and micro-interactions.

320 promptstop: 87.4
Full-StackEND-TO-END

Complete applications - auth flows, CRUD operations, and API integrations.

260 promptstop: 82.7
Code GolfOPTIMIZATION

Minimal-code solutions, algorithmic challenges, and elegant one-liners.

450 promptstop: 86.8

Run your first eval in under 60 seconds.

No credit card. No "request access." Just a binary and a leaderboard waiting for you.