AI Agent Evaluations
Performance results of AI coding agents on Nuxt code generation tasks, measuring success rate and execution time.
Agent Performance Results
| Model | Agent | Avg Duration | Success Rate | First-Try Rate | |
|---|---|---|---|---|---|
Claude Sonnet 5 | Claude Code | 292.09s | 100% | 100% | |
Cursor Composer 2.0 | Cursor | 273.91s | 100% | 97% | |
Claude Fable 5 | Claude Code | 298.81s | 100% | 97% | |
GPT 5.3 Codex (xhigh) | Codex | 276.52s | 100% | 90% | |
Gemini 3 Pro Preview | OpenCode | 291.65s | 100% | 90% | |
Claude Opus 4.8 | Claude Code | 261.37s | 100% | 90% | |
Kimi K2.7 Code | OpenCode | 327.00s | 100% | 83% | |
GPT 5.5 Pro | Codex | 666.02s | 97% | 93% | |
Cursor Composer 2.5 | Cursor | 263.31s | 97% | 90% | |
Claude Opus 4.7 | Claude Code | 215.73s | 97% | 90% | |
MiniMax M3 | OpenCode | 224.89s | 97% | 86% | |
Gemini 3.1 Pro Preview | OpenCode | 289.46s | 97% | 83% | |
Claude Opus 4.6 | Claude Code | 244.85s | 97% | 83% | |
Claude Sonnet 4.6 | Claude Code | 254.11s | 93% | 83% | |
Kimi K2.6 | OpenCode | 285.47s | 93% | 79% | |
GPT 5.4 (xhigh) | Codex | 302.57s | 90% | 76% | |
Claude Sonnet 4.5 | Claude Code | 240.76s | 59% | 48% | |
MiniMax M2.7 | OpenCode | 204.88s | 48% | 31% |
Each evaluation is attempted up to 4 times. Success Rate is the percentage of evals that passed on at least one attempt; First-Try Rate is the percentage that passed on the first attempt, used to break ties between models with the same success rate. Avg Duration is the mean time an agent took per eval. Expand a row to see per-eval results, where a 1/3 badge means the eval failed twice before passing.