new · Live HTML Preview Rendering Evals active

The World's #1
Vibe Coding Benchmark

We built the premier evaluation platform for AI models. 55+ frontier models, head-to-head scoring, and an AI judge that grades it all. No vendor hype. Just raw, real results.

Get Started Free

4,400+ prompts judged·50+ frontier models benchmarked·10 vibe coding categories

~/projects/acme · World of AI Bench

v0.9.3

$bench run --model claude-opus-4-5 --suite full

→ resolved 14 benchmarks · 18,540 graded samples

→ concurrency=64 · est. cost $42.18 · est. time 14m

✓ authenticated · org acme · cluster us-east-2

spawning 64 workers...

✓ mmlu-pro · 12,032 / 12,032 · 88.4

✓ humaneval+ · 164 / 164 · 92.7

✓ swe-bench · 500 / 500 · 71.2

! aime-2025 · 30 / 30 · retried 2 timeouts

✓ aime-2025 · 30 / 30 · 76.1

✓ gpqa · 198 / 198 · 64.8

✓ tau-bench · 500 / 500 · 60.5

→ 3 disagreements flagged for human review

✓ run complete · 14/14 benchmarks · composite 79.4

→ report saved to runs/run_a7f2c.json

55+

FRONTIER MODELS BENCHMARKED

85.2

TOP COMPOSITE SCORE

4.4K

PROMPTS AI-JUDGED

VIBE CODING CATEGORIES

Claude Fable 5·85.2

GPT-5.6 Sol·82.4

Kimi K3·80.9

GPT-5.6 Terra·78.4

GPT-5.5·77.7

GPT-5.6 Luna·77.7

Claude Opus 4.8·77.6

Grok 4.5·77.4

Claude Opus 4.7·77.2

GPT-5.4·72.0

Gemini 3.5 Flash·71.5

Kimi K2.7 Code·72.7

MiniMax M3·74.8

Qwen 3.7 Max·74.4

Gemini 3.1 Pro·70.7

GPT-5·70.0

Tencent Hunyuan Hy3·69.9

Ornith 1.0 397B·69.9

Claude Sonnet 4.6·69.7

Nex N2 Pro·69.7

DeepSeek V4 Pro·69.5

Claude Fable 5·85.2

GPT-5.6 Sol·82.4

Kimi K3·80.9

GPT-5.6 Terra·78.4

GPT-5.5·77.7

GPT-5.6 Luna·77.7

Claude Opus 4.8·77.6

Grok 4.5·77.4

Claude Opus 4.7·77.2

GPT-5.4·72.0

Gemini 3.5 Flash·71.5

Kimi K2.7 Code·72.7

MiniMax M3·74.8

Qwen 3.7 Max·74.4

Gemini 3.1 Pro·70.7

GPT-5·70.0

Tencent Hunyuan Hy3·69.9

Ornith 1.0 397B·69.9

Claude Sonnet 4.6·69.7

Nex N2 Pro·69.7

DeepSeek V4 Pro·69.5

01 · the leaderboard

The whole frontier, ranked by what you actually ship.

Sort by composite score, narrow to coding or reasoning, filter by price or open-weight. Every cell links to the exact test cases - no black boxes, no marketing math.

world-of-ai · leaderboard

live · 4,400+ prompts scored

#	Model	Composite	Frontend	Creative	Game Dev	Agentic	SVG Art
01	Claude Fable 5 Anthropic	85.2	85.1	83.2	84.2	87.2	83.7
02	GPT-5.6 Sol OpenAI	82.4	82.8	82.4	82.0	83.0	82.3
03	Kimi K3 Moonshot	80.9	82.1	81.0	80.4	81.2	80.8
04	GPT-5.6 Terra OpenAI	78.4	78.3	78.0	78.1	78.8	78.0
05	GPT-5.5 OpenAI	77.7	77.9	76.9	78.2	77.9	76.8

Pricing

Professional benchmarking. Simple pricing.

Get access to hosted runs, private workspaces, and head-to-head arena modes. Choose the plan that best fits your scale.

MonthlyAnnually Save 17%

Hobby / OSS

$0/forever

For individual developers, hobbyists, and open source contributors.

Public leaderboard access
"Can I Run It?" hardware calculator
Community Discord support

Get started

Popular

PRO

$12/mo

For professional developers, prompt engineers, and growing AI startups.

All Hobby features included
Hosted runs & private workspaces
AI-as-a-Judge auto-scoring engine
Live head-to-head model arena
Smart model use-case matches
Priority grading capacity support

Upgrade to Pro

Enterprise

Custom

For large teams and enterprises requiring custom scales, VPC, and dedicated grading.

Secure VPC or On-prem deployment
Custom LLM evaluation rubrics
Private custom model integrations
Team dashboard & workspace sharing
Dedicated capacity & custom SLAs

Talk to sales

why World of AI Bench

Numbers you can defend in a design review.

Built by a team that got tired of inconsistent evals, vendor-curated benchmarks, and "trust us, it's better." Three principles, no exceptions.

01CAN I RUN

Instant model compatibility checks.

Paste any model ID and we'll tell you which benchmarks it can run, what it'll cost per prompt, and where it might fail. No guesswork - just data.

$ bench can-i-run claude-opus-4

provider: anthropic

categories: 10/10 supported

est. cost: $0.84 / prompt

✓ ready to benchmark

02PROMPT LIBRARY

3,900+ vibe-coding prompts.

The largest curated prompt set for evaluating frontier models. 10 categories - frontend UI, games, SVG art, creative, agentic, and more. Battle-tested across 15+ models.

Categories3,900+ prompts

Frontend UI

Game Dev

SVG Art

Creative

Agentic

3D Graphics

Data Viz

Animation

Full-Stack

Code Golf

03AI JUDGE

Multi-agent scoring that matches human consensus.

Our AI judge panel uses weighted rubrics to grade functionality, design, code quality, and creativity. 5 dimensions. No cherry-picked evals. Every score is auditable.

judge-panel gemini-3.1-pro

rubric: v3.2 · 5 dimensions

✓ functionality 94/100

✓ design 91/100

✓ code quality 88/100

✓ creativity 92/100

composite: 91.3

THE PLATFORM

Every model. Every category. One dashboard.

Run any frontier model against our 3,900+ prompt library. Get AI-judged scores in 10 vibe-coding categories, compare results head-to-head, and share your benchmarks to the community showcase.

Run benchmarks across 16+ frontier models - OpenAI, Anthropic, Google, xAI, DeepSeek
AI-powered judge panel scores every response on 5 weighted dimensions
Head-to-head arena mode for direct model comparisons
Community showcase - share & browse the best AI-generated creations

● bench.config.yaml14 lines

# Production model gate - run on every PR

suite: code-suite-v2

fail_if:

composite: < 82.0

swe-bench: < 68.0

latency_p50: > 2.0s

models:

- id: claude-sonnet-4-5

- id: gpt-5-mini

- id: deepseek-v3-2

benchmarks:

- humaneval+

- swe-bench

- livecodebench

- custom: ./internal/refactor-suite

grader: llm-judge

budget: $25

THE CATEGORIES

10 categories, one vibe score.

Industry-standard suites for coding, reasoning, math, agents, and long-context. All graded with identical samplers and contamination checks - so a 71 on SWE-bench means the same thing across vendors.

Frontend UIINTERACTIVE

Full web interfaces - dashboards, forms, landing pages, and component libraries.

520 promptstop: 93.2

Creative WritingGENERATIVE

Story generators, poetry engines, interactive fiction, and narrative tools.

380 promptstop: 90.8

Game DevINTERACTIVE

Browser games - platformers, puzzles, card games, and physics simulations.

450 promptstop: 88.1

Agentic TasksAUTONOMOUS

Multi-step workflows - data pipelines, API orchestration, and planning agents.

340 promptstop: 92.5

SVG ArtVISUAL

Vector illustrations, icons, animated graphics, and generative art pieces.

410 promptstop: 91.9

3D GraphicsRENDERING

Three.js scenes, WebGL shaders, 3D visualizations, and procedural geometry.

290 promptstop: 84.2

Data VizANALYTICAL

Charts, graphs, interactive dashboards, and real-time data visualizations.

380 promptstop: 89.5

AnimationMOTION

CSS animations, canvas effects, particle systems, and micro-interactions.

320 promptstop: 87.4

Full-StackEND-TO-END

Complete applications - auth flows, CRUD operations, and API integrations.

260 promptstop: 82.7

Code GolfOPTIMIZATION

Minimal-code solutions, algorithmic challenges, and elegant one-liners.

450 promptstop: 86.8

SHOWCASE

Top generations, ranked by AI judges.

GPT-5.5

100/100

Bold Color-Blocked Landing

GPT-5.5

100/100

Interactive Data Dashboard

GPT-5.5

100/100

360° Product Viewer

GPT-5.5

100/100

NVIDIA GPU Product Page

GPT-5.5

100/100

SaaS Dashboard

Claude opus-4-7

99/100

Bold Color-Blocked Landing

Claude opus-4-7

99/100

NVIDIA GPU Product Page

GPT-5.5

99/100

Minecraft Clone

Claude opus-4-7

100/100

Pseudo-3D Racing Game

Claude opus-4-7

100/100

Minecraft Clone

Claude opus-4-7

99/100

Responsive E-Commerce

Claude opus-4-7

99/100

SaaS Landing Page (Editorial)

Real model outputs - scored by our AI judge panel across 5 dimensions

We spent two months building our own eval harness. Switched to World of AI Bench on a Friday afternoon - the numbers came out within 0.3 points, and we shipped to prod the next Tuesday.

Jamie Müller · Staff ML Engineer, Stripe Risk Platform

Run your first eval in under 60 seconds.

No credit card. No "request access." Just a binary and a leaderboard waiting for you.

Get Started Free

View Leaderboard

CONTACT

Get in Touch

Have questions about enterprise plans, custom integrations, or partnerships? We'd love to hear from you.

The World's #1Vibe Coding Benchmark

The whole frontier, ranked by what you actually ship.

Professional benchmarking. Simple pricing.

Numbers you can defend in a design review.

Instant model compatibility checks.

3,900+ vibe-coding prompts.

Multi-agent scoring that matches human consensus.

Every model. Every category. One dashboard.

10 categories, one vibe score.

Top generations, ranked by AI judges.

Run your first eval in under 60 seconds.

Get in Touch

Message Sent!

The World's #1
Vibe Coding Benchmark