Post

Claude Max vs Codex Max: One Week, Eight Agents, Two Very Different Usage Curves

Claude Max vs Codex Max: One Week, Eight Agents, Two Very Different Usage Curves

This week I did something slightly irresponsible.

I ran eight AI coding agents in parallel against a real codebase, 8-10 hours per day, every day. First with Claude Code CLI on the Max plan. Then with OpenAI’s Codex CLI on their $200 plan.

This wasn’t a benchmark. It was a real workload – building and iterating on four different projects simultaneously. And it produced a pretty clear data point about how these two tools hold up under sustained heavy use.

The Setup

  Claude Max Codex Max
Cost $200/mo $200/mo
Model Opus 4.6 GPT-5.3 (high context)
Interface Claude Code CLI Codex CLI
Agents ~8 concurrent ~8 concurrent
Hours/day 8-10 8-10

Both tools running against the same set of projects. Same types of tasks: level design, UI work, bug fixes, refactors, debug tooling, greenfield builds.

Phase 1: Claude Max Burns Hot

I started Sunday.

By Tuesday, I hit the Claude Max weekly usage limit.

That’s roughly two and a half days of aggressive, nearly continuous multi-agent use. To be fair, I was pushing it hard – eight concurrent agent instances, heavy code generation, refactors, system-level changes, debug loops, build/test cycles.

The output quality was strong. Claude tends to “just do it” – you describe what you want, and it moves. Less clarification, more action. For rapid iteration, that directness is valuable.

But at this intensity, the $200/month allocation ran out fast.

Claude Code usage at 100% after 2 days

Phase 2: Switching to Codex

When Claude credits ran out Tuesday night, I moved to Codex.

Same workload. Same eight concurrent instances. Same types of changes.

By Friday – roughly the same amount of usage time Claude had before it hit the wall – Codex was at about 50% weekly usage.

OpenAI Codex usage at 50% midweek

I kept going. Claude’s weekly allocation didn’t reset until Saturday, so Codex carried the full workload through the rest of the week. By the time the weekly limit reset, Codex had burned through 75% of its allocation – covering roughly 5 full days of aggressive multi-agent use compared to Claude’s 2.

OpenAI Codex at 75% usage (25% remaining) after carrying the full week

That’s a meaningful difference at the same $200/month price point. Not scientific. Just direct experience from the same developer, same projects, same work habits.

Behavioral Differences

Beyond raw usage capacity, the tools feel different to work with.

Claude tends to act. You give it a task, it executes. Less back-and-forth, more output. This is great when you know exactly what you want and just need it done.

Codex tends to clarify. It asks more questions upfront. There’s slightly more scaffolding required before it gets moving.

For example, I wanted a persistent build-test loop agent that starts an HTML5 dev server on port 9000, rebuilds the project, verifies the load, and repeats. With Claude, this was pretty much describe-and-go. With Codex, it took a few iterations to get the workflow structured correctly.

But once it worked, I turned it into a reusable .md workflow file. Now I can just say “use the build-test loop agent in that file” and it picks right up. That initial scaffolding investment created durable infrastructure.

Output Quality

I don’t have a strict one-to-one diff comparison yet. But practically:

  • Everything I needed Claude to do, Codex also handled
  • Multi-agent orchestration worked well on both
  • Large-scale edits across files were stable on both
  • Context depth felt strong on both
  • Parallel sessions stayed stable on both

Both are very capable tools. At this point in the market, the quality gap between top-tier models is narrow. The differences show up more in usage economics and workflow ergonomics than in raw capability.

The Numbers

The interesting question isn’t “which model is better.” It’s: what does each tool actually produce per dollar?

I went back through the git history across all four projects and split every commit by date – Claude handled Feb 9-11, Codex carried Feb 12-17.

Raw Output

  Claude (2.5 days) Codex (5 days) Combined
Commits ~73 ~76 ~149
Lines added ~36,000 ~97,000 ~133,000
Lines deleted ~8,000 ~35,000 ~43,000
Total lines changed ~44,000 ~132,000 ~176,000
Files touched ~833 ~1,167 ~2,000
Allocation used 100% 75%
Weekly cost (pro-rated) ~$46 ~$46 ~$92

Cost Per Unit

Metric Claude Codex Combined
Cost per commit $0.63 $0.61 $0.62
Cost per 1,000 lines changed $1.05 $0.35 $0.52
Cost per 1,000 net lines added $1.63 $0.74 $1.02
Cost per file touched $0.055 $0.039 $0.046

Throughput

Metric Claude Codex
Commits per day 29.2 15.2
Lines changed per day 17,600 26,400

This reveals a real behavioral difference in the data. Claude produced more commits per day – faster cycles, more action-oriented. Codex produced more lines per day – bigger, more considered changes per commit.

Usage Efficiency

Metric Claude Codex
Commits per 1% of allocation 0.73 1.01
Lines changed per 1% of allocation 440 1,760

This is the most striking number. Codex gets roughly 4x more lines changed per unit of weekly allocation at the same price point. Whether that translates to 4x more value depends entirely on what those lines do – but as a raw efficiency metric, the gap is significant.

The “Junior Dev Equivalency” Number

A mid-level developer costs roughly $150K/year, or about $2,900/week. This week’s combined AI output – ~149 commits, ~176,000 lines changed across 4 projects – cost $92. Even if half the output needs human review and rework, that’s a 31:1 cost ratio.

That number will compress as usage scales and as these tools mature. But right now, for this type of work, the economics are hard to argue with.

Because velocity without containment is expensive. Earlier this week, a one-line null check caused hours of debugging because large cross-cutting commits from multiple agents obscured the root cause. The tool didn’t matter. Process did.

What $400 of AI Actually Built

The usage numbers are interesting, but the real question is: what did this week of two-tool, eight-agent development actually produce?

Project Engine Commits What Was Built
All Is Vanity Haxe/HTML5 111 Boss combat rework (10 worlds), bullet-hell patterns, 2D overworld map, cinematics, tutorial system, hero select, full UI polish. Version 1.0 to 1.1.82. Dormant since 2020 – brought back to life.
Beat Them Up Godot 4.6 11 Built from scratch. 6 enemy types, hub-based district progression, XP system, co-op, equipment, cutscenes, 268 passing tests.
Taxi in Space Godot 18 Built from scratch in 2 days. Story mode + gauntlet mode (16 levels), 4 voiced characters, controller/touch support, web export. Playable.
MindSpasm React/Node/TS 9 Flash comic builder modernized from scratch. Character editor with IK posing, comic editor, SVG asset library, Docker, PostgreSQL.

Four projects. ~150 commits. One week. $400.

That’s the kind of throughput that makes traditional estimation models break down. But it only works if your process can keep up – as I wrote about in my multi-agent containment post, the same velocity that produced all of this also produced a multi-hour debugging session over a one-line null check.

There’s a discussion on Hacker News right now about how to meaningfully evaluate AI tool capabilities. The community keeps landing on the same point: impressive demos and benchmarks don’t tell you much. What matters is what you ship under real conditions. I agree completely. The data above isn’t a benchmark. It’s a week of actual work.

Where This Lands

Both tools are strong. Here’s my honest read after one week:

Claude Max ($200/mo):

  • Fast, direct, high throughput
  • Less hand-holding, more action
  • 100% weekly usage burned in ~2 days under heavy multi-agent load
  • At the same $200/mo price point, you need to pace yourself more carefully

Codex Max ($200/mo):

  • More scaffolding upfront, but creates reusable workflows
  • Asks more clarifying questions (pro or con depending on your style)
  • 75% weekly usage after ~5 days of the same intensity – meaningfully more headroom
  • Better value per dollar if you’re running sustained heavy workloads

If you’re running one agent casually for a few hours a day, this comparison won’t matter much. Both will feel generous.

If you’re running eight agents simultaneously for 8-10 hours a day, the usage curve difference is real and it matters for planning your week.

What’s Next

The cost-per-line and cost-per-commit numbers above are a start, but they’re not the whole picture. Lines of code aren’t value. Commits aren’t features.

What I want to track in week 2:

  • Regressions introduced per tool
  • Debug hours attributable to AI-generated code
  • Features that shipped to a playable state vs features that needed significant rework
  • Cost per validated, working feature

Not vibes. Not preferences. Not marketing.

Output per dollar per validated build. That’s the metric that actually matters.

This post is licensed under CC BY 4.0 by the author.