Methodology

How Arena compares visual coding agents

Arena is optimized for watchable, reproducible visual tasks. It is intentionally narrower than a universal benchmark.

Run protocol

Same task, same prompt, same starter

Each agent and model pair receives the same task package, time limit, prompt, and repair budget. Final results are judged through visible browser output, automated checks, and a human-readable rubric.

Prompt

One initial prompt and a fixed number of repair turns.

Time limit

V1 uses a 20-minute target per run.

Final checks

Desktop and mobile render, console errors, and key interactions.

Evidence

Screenshots, clips, logs, and result notes are attached to each run.

Scoring

Readable first, precise enough to compare

20

Runs without fatal errors

Starts, renders, and remains interactable.

30

Core functionality

Completes the actual requested behavior.

20

Visual completion

Looks coherent and communicates state.

20

Interaction quality

Feels usable across expected inputs.

10

Code sanity

Avoids obvious brittle shortcuts.

Limitations

Not a universal coding benchmark

Arena results are strongest for visual coding tasks such as games, Canvas tools, and Three.js scenes. They should be read as public evidence for specific case behavior, not as a global model ranking.