AI Agent Evaluations

Performance results of AI coding agents on Nuxt code generation tasks, measuring success rate and execution time.
View on GitHubLast run date: July 3, 2026

Agent Performance Results

ModelAgentAvg DurationSuccess RateFirst-Try Rate
Claude Sonnet 5
Claude Code292.09s100%100%
Cursor Composer 2.0
Cursor273.91s100%97%
Claude Fable 5
Claude Code298.81s100%97%
GPT 5.3 Codex (xhigh)
Codex276.52s100%90%
Gemini 3 Pro Preview
OpenCode291.65s100%90%
Claude Opus 4.8
Claude Code261.37s100%90%
Kimi K2.7 Code
OpenCode327.00s100%83%
GPT 5.5 Pro
Codex666.02s97%93%
Cursor Composer 2.5
Cursor263.31s97%90%
Claude Opus 4.7
Claude Code215.73s97%90%
MiniMax M3
OpenCode224.89s97%86%
Gemini 3.1 Pro Preview
OpenCode289.46s97%83%
Claude Opus 4.6
Claude Code244.85s97%83%
Claude Sonnet 4.6
Claude Code254.11s93%83%
Kimi K2.6
OpenCode285.47s93%79%
GPT 5.4 (xhigh)
Codex302.57s90%76%
Claude Sonnet 4.5
Claude Code240.76s59%48%
MiniMax M2.7
OpenCode204.88s48%31%
Each evaluation is attempted up to 4 times. Success Rate is the percentage of evals that passed on at least one attempt; First-Try Rate is the percentage that passed on the first attempt, used to break ties between models with the same success rate. Avg Duration is the mean time an agent took per eval. Expand a row to see per-eval results, where a 1/3 badge means the eval failed twice before passing.