Alibaba's SWE-CI benchmark is the most comprehensive test of AI coding agents in production-like conditions that exists right now. 100 real codebases. 233 days of continuous AI-generated commits. Multiple models tested across maintenance tasks - the kind of work that makes up the majority of real engineering: fixing bugs, updating dependencies, refactoring code.
The headline number: 75% of AI models broke previously working code during maintenance tasks.
Let that sit for a second. These aren't toy benchmarks. These are real codebases with real dependencies and real CI pipelines. And three out of four models introduced regressions during routine maintenance.
The instinct is to read this and think "well, the models will get better." They will. But that misses the point.
The failure mode isn't that agent code is bad. The code each agent writes is usually correct in isolation. The problem is that agent code is incompatible with other code that changed while the agent was working. The agent read the repo at time T, worked for a while, and pushed at time T+1. Between T and T+1, something else changed. The agent doesn't know.
Better models don't fix this. A smarter agent that writes perfect code still doesn't know what another agent changed 30 seconds ago on a different branch. This is a coordination problem, not an intelligence problem.
The failure mode isn't bad code. It's incompatible code. That's a fundamentally different problem, and it requires a fundamentally different solution.
A month before the SWE-CI results dropped, Amazon's Kiro AI coding tool caused a 13-hour production outage. It was given permission to fix a customer-facing AWS system. It autonomously decided to delete and recreate the entire environment. A senior AWS engineer told the Financial Times the outages were "entirely foreseeable."
That's a single agent acting alone. Now imagine 10 agents, 50 agents, 200 agents all pushing code to the same codebase. The Kiro incident was one agent making one bad decision. The SWE-CI benchmark shows what happens when agents make individually correct decisions that are collectively wrong.
Every company I talk to is adding more agents, not fewer. Cursor, Claude Code, Codex, Copilot, Windsurf - engineering teams are deploying multiple agents simultaneously because each one has different strengths. The number of agents per codebase is going up, not down.
Aaron Levie says companies will have 100-1000x more agents than employees. Karpathy is running 100 experiments overnight on git branches. Jensen Huang just described AI as a 5-layer infrastructure stack. The entire industry is accelerating toward more agents writing more code faster.
And nobody is checking whether all that output is compatible before it merges.
Here's what I think happens over the next year:
First wave (now - 6 months): Early adopters start seeing broken builds they can't explain. CI is green on every branch but main is red after merge. Engineers spend hours debugging conflicts between agent-generated code. Teams start pulling back on agent adoption because the breakage isn't worth the speed.
Second wave (6 - 12 months): The companies that figured out output verification keep scaling agents. The ones that didn't either slow down or eat the cost of constant breakage. The gap between "agent-native" companies and everyone else widens. Output verification becomes a line item in the platform engineering budget, the same way CI/CD was 10 years ago.
The SWE-CI benchmark isn't just a research paper. It's a preview. 75% failure rates on maintenance tasks means every engineering team that's deploying agents is going to hit this wall. The only question is whether they hit it before or after production breaks.
Rosentic scans every active branch against every other active branch. When Agent A changes the order API and Agent B is building against the old one, the engine catches it before either PR merges. Cross-branch, cross-agent, cross-language. Deterministic. Sub-second on warm scans.
The SWE-CI benchmark proves the problem exists at scale. We're building the fix.