What Happens When a Team Goes All-In on AI for Two Sprints

TL;DR: A few teams are about to run a two-sprint experiment using Claude Code for as much implementation as possible. Having done the solo version of this for six months, here's what I think changes about sprint rituals, what I'd watch for, and how I'd define success: maintain constant flow and the same level of code maintainability. Speed gains, if any, are what you measure after.

I’ve spent six months building projects with AI without writing code myself. It worked as a solo experiment, but “one senior engineer with deep context moves fast with AI” doesn’t answer the harder question: can an entire team work this way?

A few teams at Zendesk are about to find out. They’ll run two sprints where every engineer picks something already defined on their roadmap and uses Claude Code for as much of the planning and implementation as possible. Engineers steer, review, and make decisions, but the code comes from AI as much as they can manage.

I’m not running the experiment, but I’ve been through the solo version of it twice. This article is what I’d tell those teams before they start: what I think changes, what I’d watch for, and how I’d define success.

The two things worth measuring

The temptation is to frame this as a speed test. Can a team ship more story points with AI? That’s the wrong question, or at least a premature one.

The right question is whether the team can maintain two things: a constant flow of work without excessive stop-and-start cycles, and the same level of code maintainability at the end. If those hold, you can look at the data and ask whether there was a time gain, a capacity multiplication, or something else entirely. If flow breaks down or the codebase turns into something nobody wants to touch, it doesn’t matter how fast you went.

Flow is the process measure, maintainability is the quality measure, and everything else is a bonus.

Planning barely changes, but socialization becomes mandatory

Sprint planning itself doesn’t look that different. You’re still asking “what are we building and why” and breaking work into milestones, each one scoped to produce a reviewable PR. (I use the get-shit-done framework where these are called “phases,” but the principle is the same regardless of your tooling: break the work into meaningful checkpoints so you have natural review opportunities built into the workflow.)

What changes is that the plan becomes the most consequential artifact the team produces. When AI executes your plan faithfully (including faithfully executing your bad decisions), the quality of the plan directly determines the quality of the output. In my solo projects, I learned this the hard way: the implementation plan I spent the most time on produced the best results by far (I wrote about this in the first article in this series).

For a team, this means plans can’t live in one person’s head. Pair planning, where two engineers walk through a spec together before AI touches it, becomes a quality control measure. Catching a wrong assumption in a spec costs minutes; catching it in generated code costs hours.

Standups track milestones and deliverables, not effort

The daily check-in shifts in a subtle but meaningful way. You still need to explain what you’re working on (not everyone knows what your “milestone 3” means), but the unit of progress becomes a completed, reviewable deliverable rather than a status update about partially written code. “The PR for the webhook integration is up, starting on the retry logic today” instead of “I’m about 70% through the service layer.” Each deliverable will likely be larger than a typical hand-coded increment, but it’s concrete and reviewable, which makes it a cleaner measure of progress than a gut feeling about percentage complete.

Review becomes the team’s primary shared activity

This is where the biggest behavioral shift happens. In a normal sprint, code review is a gate: someone finishes work, puts up a PR, and waits for a reviewer to context-switch away from their own code. The context switch is expensive, reviews get delayed, PRs pile up.

When everyone is running Claude Code sessions, that rhythm inverts. While your AI is working on your current milestone, you’re reviewing someone else’s completed deliverable. There’s no flow state to interrupt because you’re not the one writing code, so the dead time between “I told Claude what to build” and “Claude finished building it” becomes review time naturally.

Review stops being a bottleneck and becomes the primary mechanism for knowledge sharing across the team. In a traditional sprint, you learn about what others are building through osmosis. In this model, reviewing each other’s AI-generated code is how the team stays connected to what’s being built. It’s a core activity, not an afterthought.

The discipline from my solo projects applies here too: PRs need to stay scoped to one milestone. AI makes scope creep dangerously easy, and the moment a PR balloons, review collapses and you’re back to the bottleneck.

Retros measure input quality and tool fluency, not output volume

Output and velocity continue to matter, but when talking about what went well and what didn’t, input quality and tool fluency should be top of the list. After two sprints, the most important retro question isn’t “how much did we ship” but “how good were our inputs, and how well did we learn to steer the output.” The team should be looking at the ratio of planning time to steering time, because that tells you where value is actually created. It matters whether rework happened because the spec was undercooked or because the AI made a bad call that needed correcting, since those are two very different problems with different fixes. And anything that differed from the existing workflow, felt harder, or can’t scale in its current form is worth naming explicitly.

The tool fluency piece matters more than it sounds. Every model has quirks, patterns it defaults to, mistakes it makes consistently, areas where it’s overconfident. Learning to work with and through those tendencies is a real skill, and two sprints should surface a lot of that knowledge. Which prompting approaches produced clean output? Where did the team have to fight the model? What conventions did they land on for steering it back on track? That knowledge compounds across every future sprint.

This maps directly to the writer-to-editor shift I’ve been writing about in this series. The quality of what you wrote (the spec) determines the starting point, but the quality of your editing (steering, correcting, knowing when to push back on the model) determines the final result.

What I’d watch for

I have two fears for teams trying this, and they pull in opposite directions.

The first is that teams spend so much time setting up, configuring their environment, writing the perfect spec, and debating approach that two sprints pass with nothing meaningful delivered.

The second is the opposite: teams jump in too fast, skip planning, and end up going in all directions with AI generating code against undercooked requirements.

The calibration between these two failure modes is something I learned through trial and error: jumping in too fast produced brutal PRs, while investing in planning paid for itself. My advice: pick something and move. If the first attempt is bad, throw it away and refine the plan. The cost of a failed AI implementation attempt is dramatically lower than a failed hand-coded attempt (as long as you take neither to production), so the penalty for starting before you’re perfectly ready is lower too.

What comes after

I’m writing this before the experiment runs, on purpose. I want a record of what I thought would matter before reality corrects me. When the team running the program publishes their results, the follow-up will be the honest accounting: what matched, what surprised everyone, and whether flow and maintainability held up or just sounded good on paper.

The deeper question underneath all of this is whether a team that stops writing code together can still function as a team. My hunch is yes, that the collaboration just moves upstream to specs and reviews. I also suspect two sprints won’t be enough to really know. It took me weeks of solo work to calibrate the right balance between planning and execution, and asking a team to find that collectively in a month is ambitious. But a short experiment that surfaces the right questions is more useful than a long one that never happens.