Claude Max vs Codex Max: One Week, Eight Agents, Two Very Different Usage Curves
This week I did something slightly irresponsible.
I ran eight AI coding agents in parallel against a real codebase, 8-10 hours per day, every day. First with Claude Code CLI on the Max plan. Then with OpenAI’s Codex CLI on their $200 plan.
This wasn’t a benchmark. It was a real workload – building and iterating on four different projects simultaneously. And it produced a pretty clear data point about how these two tools hold up under sustained heavy use.
The Setup
| Claude Max | Codex Max | |
|---|---|---|
| Cost | $200/mo | $200/mo |
| Model | Opus 4.6 | GPT-5.3 (high context) |
| Interface | Claude Code CLI | Codex CLI |
| Agents | ~8 concurrent | ~8 concurrent |
| Hours/day | 8-10 | 8-10 |
Both tools running against the same set of projects. Same types of tasks: level design, UI work, bug fixes, refactors, debug tooling, greenfield builds.
Phase 1: Claude Max Burns Hot
I started Sunday.
By Tuesday, I hit the Claude Max weekly usage limit.
That’s roughly two and a half days of aggressive, nearly continuous multi-agent use. To be fair, I was pushing it hard – eight concurrent agent instances, heavy code generation, refactors, system-level changes, debug loops, build/test cycles.
The output quality was strong. Claude tends to “just do it” – you describe what you want, and it moves. Less clarification, more action. For rapid iteration, that directness is valuable.
But at this intensity, the $200/month allocation ran out fast.
Phase 2: Switching to Codex
When Claude credits ran out Tuesday night, I moved to Codex.
Same workload. Same eight concurrent instances. Same types of changes.
By Friday – roughly the same amount of usage time Claude had before it hit the wall – Codex was at about 50% weekly usage.
I kept going. Claude’s weekly allocation didn’t reset until Saturday, so Codex carried the full workload through the rest of the week. By the time the weekly limit reset, Codex had burned through 75% of its allocation – covering roughly 5 full days of aggressive multi-agent use compared to Claude’s 2.
That’s a meaningful difference at the same $200/month price point. Not scientific. Just direct experience from the same developer, same projects, same work habits.
Behavioral Differences
Beyond raw usage capacity, the tools feel different to work with.
Claude tends to act. You give it a task, it executes. Less back-and-forth, more output. This is great when you know exactly what you want and just need it done.
Codex tends to clarify. It asks more questions upfront. There’s slightly more scaffolding required before it gets moving.
For example, I wanted a persistent build-test loop agent that starts an HTML5 dev server on port 9000, rebuilds the project, verifies the load, and repeats. With Claude, this was pretty much describe-and-go. With Codex, it took a few iterations to get the workflow structured correctly.
But once it worked, I turned it into a reusable .md workflow file. Now I can just say “use the build-test loop agent in that file” and it picks right up. That initial scaffolding investment created durable infrastructure.
Output Quality
I don’t have a strict one-to-one diff comparison yet. But practically:
- Everything I needed Claude to do, Codex also handled
- Multi-agent orchestration worked well on both
- Large-scale edits across files were stable on both
- Context depth felt strong on both
- Parallel sessions stayed stable on both
Both are very capable tools. At this point in the market, the quality gap between top-tier models is narrow. The differences show up more in usage economics and workflow ergonomics than in raw capability.
The Numbers
The interesting question isn’t “which model is better.” It’s: what does each tool actually produce per dollar?
I went back through the git history across all four projects and split every commit by date – Claude handled Feb 9-11, Codex carried Feb 12-17.
Raw Output
| Claude (2.5 days) | Codex (5 days) | Combined | |
|---|---|---|---|
| Commits | ~73 | ~76 | ~149 |
| Lines added | ~36,000 | ~97,000 | ~133,000 |
| Lines deleted | ~8,000 | ~35,000 | ~43,000 |
| Total lines changed | ~44,000 | ~132,000 | ~176,000 |
| Files touched | ~833 | ~1,167 | ~2,000 |
| Allocation used | 100% | 75% | – |
| Weekly cost (pro-rated) | ~$46 | ~$46 | ~$92 |
Cost Per Unit
| Metric | Claude | Codex | Combined |
|---|---|---|---|
| Cost per commit | $0.63 | $0.61 | $0.62 |
| Cost per 1,000 lines changed | $1.05 | $0.35 | $0.52 |
| Cost per 1,000 net lines added | $1.63 | $0.74 | $1.02 |
| Cost per file touched | $0.055 | $0.039 | $0.046 |
Throughput
| Metric | Claude | Codex |
|---|---|---|
| Commits per day | 29.2 | 15.2 |
| Lines changed per day | 17,600 | 26,400 |
This reveals a real behavioral difference in the data. Claude produced more commits per day – faster cycles, more action-oriented. Codex produced more lines per day – bigger, more considered changes per commit.
Usage Efficiency
| Metric | Claude | Codex |
|---|---|---|
| Commits per 1% of allocation | 0.73 | 1.01 |
| Lines changed per 1% of allocation | 440 | 1,760 |
This is the most striking number. Codex gets roughly 4x more lines changed per unit of weekly allocation at the same price point. Whether that translates to 4x more value depends entirely on what those lines do – but as a raw efficiency metric, the gap is significant.
The “Junior Dev Equivalency” Number
A mid-level developer costs roughly $150K/year, or about $2,900/week. This week’s combined AI output – ~149 commits, ~176,000 lines changed across 4 projects – cost $92. Even if half the output needs human review and rework, that’s a 31:1 cost ratio.
That number will compress as usage scales and as these tools mature. But right now, for this type of work, the economics are hard to argue with.
Because velocity without containment is expensive. Earlier this week, a one-line null check caused hours of debugging because large cross-cutting commits from multiple agents obscured the root cause. The tool didn’t matter. Process did.
What $400 of AI Actually Built
The usage numbers are interesting, but the real question is: what did this week of two-tool, eight-agent development actually produce?
| Project | Engine | Commits | What Was Built |
|---|---|---|---|
| All Is Vanity | Haxe/HTML5 | 111 | Boss combat rework (10 worlds), bullet-hell patterns, 2D overworld map, cinematics, tutorial system, hero select, full UI polish. Version 1.0 to 1.1.82. Dormant since 2020 – brought back to life. |
| Beat Them Up | Godot 4.6 | 11 | Built from scratch. 6 enemy types, hub-based district progression, XP system, co-op, equipment, cutscenes, 268 passing tests. |
| Taxi in Space | Godot | 18 | Built from scratch in 2 days. Story mode + gauntlet mode (16 levels), 4 voiced characters, controller/touch support, web export. Playable. |
| MindSpasm | React/Node/TS | 9 | Flash comic builder modernized from scratch. Character editor with IK posing, comic editor, SVG asset library, Docker, PostgreSQL. |
Four projects. ~150 commits. One week. $400.
That’s the kind of throughput that makes traditional estimation models break down. But it only works if your process can keep up – as I wrote about in my multi-agent containment post, the same velocity that produced all of this also produced a multi-hour debugging session over a one-line null check.
There’s a discussion on Hacker News right now about how to meaningfully evaluate AI tool capabilities. The community keeps landing on the same point: impressive demos and benchmarks don’t tell you much. What matters is what you ship under real conditions. I agree completely. The data above isn’t a benchmark. It’s a week of actual work.
Where This Lands
Both tools are strong. Here’s my honest read after one week:
Claude Max ($200/mo):
- Fast, direct, high throughput
- Less hand-holding, more action
- 100% weekly usage burned in ~2 days under heavy multi-agent load
- At the same $200/mo price point, you need to pace yourself more carefully
Codex Max ($200/mo):
- More scaffolding upfront, but creates reusable workflows
- Asks more clarifying questions (pro or con depending on your style)
- 75% weekly usage after ~5 days of the same intensity – meaningfully more headroom
- Better value per dollar if you’re running sustained heavy workloads
If you’re running one agent casually for a few hours a day, this comparison won’t matter much. Both will feel generous.
If you’re running eight agents simultaneously for 8-10 hours a day, the usage curve difference is real and it matters for planning your week.
What’s Next
The cost-per-line and cost-per-commit numbers above are a start, but they’re not the whole picture. Lines of code aren’t value. Commits aren’t features.
What I want to track in week 2:
- Regressions introduced per tool
- Debug hours attributable to AI-generated code
- Features that shipped to a playable state vs features that needed significant rework
- Cost per validated, working feature
Not vibes. Not preferences. Not marketing.
Output per dollar per validated build. That’s the metric that actually matters.


