← Back to Blog

Richard Batt |

Codex-Spark at 1,000 Tokens Per Second: Why Latency Matters More Than Benchmarks for Developer Adoption

Tags: Development, AI Tools

Codex-Spark at 1,000 Tokens Per Second: Why Latency Matters More Than Benchmarks for Developer Adoption

The Moment I Understood Why Latency Matters More Than IQ

Last month, I watched a developer use Copilot: the old version, with typical inference latency. She'd type a function stub, hit Tab to autocomplete, and wait 3–4 seconds. While waiting, she'd switch to Slack, check her email, or scroll GitHub. By the time the suggestion arrived, she'd context-switched. She'd glance at the suggestion, reject it half-read because she was already thinking about something else, and just type the code herself. The tool existed, but she wouldn't use it.

Key Takeaways

  • The Moment I Understood Why Latency Matters More Than IQ, apply this before building anything.
  • What Codex-Spark Is (and What It Signals).
  • The Latency Thesis: Speed Beats Intelligence (Up to a Point), apply this before building anything.
  • The Research: Why Flow State Matters More Than You Think, apply this before building anything.
  • Codex-Spark's Real Advantage: Iteration Speed.

Two weeks later, I watched her use an early version of something approaching instant completion. Type, Tab, instant suggestion. No wait. No context switch. She accepted the suggestion in 1.2 seconds of reading, confirmed it was right, and moved on. Same developer. Same IDE. Different latency. Completely different adoption behavior.

OpenAI just released Codex-Spark, a smaller code model designed to run on Cerebras hardware at 1,000 tokens per second. On paper, it's a downgrade from GPT-5.3-Codex: smaller model, lower on most benchmarks. But I think it's one of the smartest product decisions OpenAI has made in months. And it points to something fundamental about how developers will actually use AI tools.

What Codex-Spark Is (and What It Signals)

Codex-Spark is OpenAI's answer to a constraint problem. GPT-5.3-Codex is powerful, but generating code at 100–150 tokens per second feels slow to developers. By the time a function autocomplete arrives, the developer's attention has drifted. So OpenAI built a smaller model, optimized it for Cerebras hardware, and achieved 1,000 tokens per second: roughly 7x faster.

This is a significant engineering achievement. Cerebras is specialized silicon for large-scale computation, usually deployed on massive transformer training runs, not inference. Getting an AI inference system to run at 1,000 tokens per second requires co-designing the hardware, the model, and the serving infrastructure. Most companies don't have access to Cerebras hardware. OpenAI does, and they're using it strategically.

The partnership is notable: OpenAI has historically driven inference on its own infrastructure, but this collaboration with Cerebras signals something important. Inference is becoming specialized. Different use cases need different hardware. Codex-Spark needed Cerebras. Other inference workloads need different optimization. This is the future: specialized silicon for specialized tasks.

On benchmarks, Codex-Spark is lower than GPT-5.3-Codex. It still passes most standard code generation tests, but not all. It might struggle with complex multi-file refactoring or deeply nested reasoning about code architecture. But for the most common coding tasks: completing a function, generating a test, writing a utility: it's sufficient. It's optimized for breadth, not depth.

The Latency Thesis: Speed Beats Intelligence (Up to a Point)

Here's the hypothesis I'm testing: developer adoption of AI tools is driven less by raw intelligence than by latency. A model that's 5% dumber but responds instantly gets used 10x more than a model that's 20% smarter but requires a 3-second wait. And if you're using a tool 10x more, you learn its strengths and limitations faster, you integrate it into your workflow more deeply, and you get more value even if the tool is individually "worse."

This isn't new psychology. It's the foundation of why interfaces matter. A slow tool is an unusable tool, regardless of capability. The flow state matters: developers get into a rhythm of reading code, spotting a gap, asking the AI to fill it, reviewing the suggestion in under a second, and moving on. Break that rhythm with a 3-second latency and the flow collapses. Context switching is expensive. Douglas Engelbart's research suggested context switching costs 10–15 minutes of cognitive recovery time, even if interrupted for just a few seconds. Modern productivity research confirms it: every interruption drains focus, reduces code quality, and slows throughput.

For an AI coding tool, that means latency isn't a feature: it's the feature. A code completion that takes 3 seconds is, from a flow-state perspective, useless. The developer has already typed the code or given up on the suggestion. A code completion that takes 0.3 seconds is transformative: it slots into the developer's existing rhythm without breaking it. It becomes an extension of their thinking, not an interruption.

Codex-Spark at 1,000 tokens per second achieves that. A typical function completion is 20–40 tokens, arriving in 20–40 milliseconds. By the time a developer finishes typing the function name, the completion is ready. No wait. No context switch. Pure flow state.

The Research: Why Flow State Matters More Than You Think

There's solid research on this. Cal Newport's work on deep work identifies flow state as the foundation of high-output thinking. A person in flow produces 40% more output than someone who's constantly interrupted. Every time a developer breaks from the code to wait for a suggestion, they exit flow. Re-entry takes 15+ minutes. This isn't trivial: this is the difference between shipping 3 features a week and shipping 2.

Donald Norman's research on design friction shows that small delays compound. Adding 100 milliseconds to an interface doesn't feel like much, but it measurably changes behavior. Users wait less, make fewer requests, and abandon tools sooner. The relationship is non-linear: a 1-second delay doesn't feel 10x worse than a 100ms delay: it feels 100x worse, because it's the difference between "instant" and "delayed." Instant feels like a tool. Delayed feels like waiting.

For code generation specifically, there's an additional layer: context. When a developer asks for a code suggestion, they're holding the problem in working memory. Working memory is finite: humans can hold about 7 discrete pieces of information. While waiting for a suggestion, developers lose some of that context. By the time the suggestion arrives 3 seconds later, they've partially forgotten what they were asking for, and the suggestion feels less relevant even if it's objectively good. They've lost the thread.

Instant suggestions maintain context. The developer hasn't lost track of the problem, so they can evaluate suggestions against their intent faster and more accurately. The cognitive load drops from "hold problem + evaluate suggestion" to just "evaluate suggestion." That mental reduction matters.

Codex-Spark's Real Advantage: Iteration Speed

Here's where the practical advantage becomes obvious. Suppose a developer is writing a test suite for a function. With a slow model (3-second latency per suggestion), they might generate one test, wait, evaluate, refine the prompt, wait again. They might generate 3 tests in an hour, each one requiring careful evaluation because they're context-switching.

With Codex-Spark (instant latency), the developer can request 3 tests, get 3 suggestions in rapid-fire sequence, scan all of them in flow state, modify the 2 best ones, and move on. Same developer, same task. But with faster latency, they can iterate 5x faster because they never exit the coding flow.

Over a week, that compounds. A developer who can iterate 5x faster on code generation tasks will ship more code, write better tests, and refactor more aggressively. The AI tool becomes embedded in their workflow, not an occasional help. It becomes as natural as using autocomplete.

This is why OpenAI's choice to optimize for latency over raw intelligence is smart. They're not just building a faster model: they're building for adoption, integration, and habit formation. Those things matter more than benchmark scores. Tools that people actually use are more valuable than tools that sit unused.

The Benchmarks Don't Capture This

This is worth emphasizing because most AI tools are compared on benchmarks. We look at SWE-bench scores, HumanEval pass rates, problem-solving capabilities. Codex-Spark probably scores lower on all of these than GPT-5.3-Codex. If I was evaluating purely on benchmarks, I'd recommend GPT-5.3-Codex every time.

But benchmarks measure task-completion, not tool-adoption. They're the equivalent of judging a sports car on 0–60 time and a city car on fuel efficiency, without considering that the city car gets used 10x more because it fits people's actual needs. The sports car sits in the garage. The city car runs every day.

The right benchmark for a code completion tool is "how many times per day does a developer actually use it?" By that measure, Codex-Spark probably dominates GPT-5.3-Codex. And if developers use it more, they'll find more value in it, discover its edge cases, learn how to prompt it better. The initial intelligence gap shrinks because the usage gap compounds.

Over a year, a developer using Codex-Spark 100 times a day will write better code, iterate faster, and ship more features than a developer using GPT-5.3-Codex 20 times a day. The 5% intelligence gap is irrelevant compared to the 5x usage gap.

What Codex-Spark Signals About the Inference Future

This release isn't just about code completion. It signals something bigger about where inference is heading: specialization and optimization. The era of "one model for everything" is ending. Instead, we're moving toward smaller models optimized for specific latency requirements and deployed on specialized hardware.

For real-time code completion, you want fast. For strategic thinking or architectural analysis, you want capable, and you'll accept longer latency. For customer-facing applications, you want both fast and capable, and you're willing to pay for it. For internal tools, you might want cheap, and you'll trade speed and capability for cost.

This diversification benefits everyone. It means faster tools for latency-sensitive work. It means developers can choose the right tool for the right job instead of using one expensive, slow model for everything. It means the "good enough" model at a lower price point becomes viable instead of a compromise.

Codex-Spark is the first major signal of this trend. Expect more: specialized models for search, retrieval, classification, and reasoning: each optimized for its specific use case. Within 18 months, the idea of using a single general-purpose model for everything will seem quaint.

The Hardware Play: Why Cerebras Matters

The Cerebras partnership is the second important signal. Traditionally, inference happens on general-purpose GPUs or CPUs. Cerebras is specialized silicon designed for transformer computation. Using it for inference is unusual.

What this signals: inference is becoming a dedicated, hardware-optimized problem, separate from training. OpenAI is willing to partner with specialized hardware companies to achieve the latency targets their products need. This opens the door for other startups: specialized silicon for vision inference, for retrieval, for reasoning. The compute industry is fragmenting toward specialization.

It also signals something about cost and deployment. Cerebras hardware is expensive and specialized. OpenAI is investing in it, which means they believe the latency gains (and the subsequent adoption gains) justify the hardware cost. For developers, this might translate to lower per-token costs for Codex-Spark compared to GPT-5.3-Codex, since they're optimized for throughput, not peak performance. Throughput is cheaper to scale than latency.

Practical Advice: How to Choose Your Inference Model

If you're building a tool that needs AI capabilities, here's how I'm advising teams to think about tool selection:

First, measure actual latency, not claimed latency. OpenAI says Codex-Spark achieves 1,000 tokens per second. That's throughput. The actual latency for a single request depends on load, queueing, request size, and network conditions. Test your actual use case: time the full round trip from prompt to response in your deployment environment. Measure p50, p95, and p99 latency, not just average. Don't trust vendor claims; measure.

Second, measure adoption and flow state. Run an experiment: give some developers the fast model, others the capable model. Don't measure quality: measure usage frequency. Count how many times developers invoke the tool. Track how much they trust the suggestions. Measure time-to-completion on standard tasks. Survey developers about their experience. The tool that gets used more will likely deliver more value, even if it's individually less capable.

Third, consider your iteration speed requirement. If your task involves rapid iteration (brainstorming, exploration, testing multiple approaches), faster latency is more valuable. If your task is one-off analysis (complex architectural decision, novel problem-solving), capability matters more than latency. Mix them in your evaluation.

Fourth, don't just look at benchmarks. Benchmarks tell you what a model can do. They don't tell you what a developer will actually do with it. A 90% capable model that's instant will outperform a 95% capable model that requires a wait. Benchmark the behavior, not just the output. Ask: "If I deployed this, would developers use it?"

Where Codex-Spark Loses (and It Matters)

I don't want to oversell this. Codex-Spark isn't suitable for everything. For complex multi-file refactoring, for architectural decision-making, for novel problem domains where the right answer isn't obvious, you probably want GPT-5.3-Codex's greater capability, even if it's slower. Some tasks genuinely need a smarter model.

The sweet spot for Codex-Spark is: (1) real-time code completion, (2) unit test generation, (3) documentation and comment generation, (4) routine refactoring and optimization, (5) bug finding and fixing in familiar codebases. Tasks where the developer knows roughly what they want and the AI is accelerating execution, not doing novel reasoning.

For tasks outside that scope: "design a system for real-time pricing" or "refactor this legacy system into microservices": you still want the smartest model available, even if it's slow. Latency doesn't matter if the answer is wrong.

The future isn't "Codex-Spark replaces GPT-5.3-Codex." It's "use Codex-Spark for fast tasks, GPT-5.3-Codex for hard tasks, and optimize based on what you're actually doing." Multiple models for multiple contexts.

The Interrupt Cost: Why 300 Milliseconds Changes Everything

Let me make this concrete with numbers. Suppose a developer writes 20 functions in a day, and each function takes 5 minutes of focused work. With instant AI assistance (300ms latency), they invoke the tool 5 times per function, get instant suggestions, and maintain flow state. Net time per function: 4.5 minutes. Their throughput is higher, quality is comparable, and they're fresher Ultimately because they never lost focus.

With slow AI assistance (3-second latency), they invoke the tool 2 times per function (the interruption cost makes them reluctant to ask), lose 1 second of context per invocation while the tool processes, and exit flow state. They have to re-read their function context when the suggestion arrives. They context-switch to check email or Slack while waiting. Net time per function: 6 minutes. Their throughput is lower, and they're more cognitively tired because of constant context switching.

Same tool, different latency. But the output quality of the slow version is higher, right? Maybe slightly. But that slight quality improvement doesn't matter if it causes developers to stop using the tool. The best output in the world is worthless if it's not used.

Latency is adoption. Adoption is value. That's why Codex-Spark's focus on 1,000 tokens per second is smarter than it looks. It's not trying to be the smartest model. It's trying to be the most-used model. And that's the right optimization for developer tools.

What This Means for Your Tool Selection

If you're evaluating code AI tools, stop optimizing for benchmark scores. Start timing the actual response cycle. Run a developer experiment: give your team the tools for a week, measure how often they use each one, ask which one they'd pay for. The tool that wins adoption will deliver more value than the tool that wins benchmarks.

Codex-Spark is intentionally optimized for that adoption metric. It might not be the smartest model, but it's built to be the most-used model. For most development work, that's the right optimization target. You can always fall back to a smarter model for hard problems. But if the everyday tool is too slow, you'll never use it for everyday problems, which is where most of the value is.

Building for Real-Time

If you're building AI-powered tools, the lesson here is: latency is a feature, not a constraint. Invest in making your model response instant, even if it means using a smaller or less capable model. The adoption gains will likely outweigh the capability loss.

This applies beyond code. For document summarization, research assistance, copywriting: any task where a developer or knowledge worker is iterating rapidly: latency is the true determinant of adoption. Build for instant. Sacrifice capability if you have to. The workflow benefit will compound.

The developer who uses a 90% capable model 100 times a day will ship better work than the developer who uses a 95% capable model 20 times a day. The difference seems small until you multiply it out over weeks and months.

The Bottom Line

Codex-Spark at 1,000 tokens per second represents a philosophical choice: optimize for adoption and workflow integration, not for raw capability. It's the right choice for code completion. It's probably the right choice for many other tools.

The next time you're evaluating an AI tool, don't just look at what it can do. Look at what you'll actually do with it. Measure flow state, adoption frequency, real-world iteration speed. A tool that's 5% less capable but 10x more integrated into your workflow will deliver 10x more value. That's not a compromise: that's the optimization that matters.

OpenAI understands this. That's why they built Codex-Spark. And that's why it's going to be ubiquitous.

Need Help Choosing the Right Model?

Richard Batt has delivered 120+ AI and automation projects across 15+ industries. He helps businesses deploy AI that actually works, with battle-tested tools, templates, and implementation roadmaps. Featured in InfoWorld and WSJ.

Frequently Asked Questions

How long does it take to implement AI automation in a small business?

Most single-process automations take 1-5 days to implement and start delivering ROI within 30-90 days. Complex multi-system integrations take 2-8 weeks. The key is starting with one well-defined process, proving the value, then expanding.

Do I need technical skills to automate business processes?

Not for most automations. Tools like Zapier, Make.com, and N8N use visual builders that require no coding. About 80% of small business automation can be done without a developer. For the remaining 20%, you need someone comfortable with APIs and basic scripting.

Where should a business start with AI implementation?

Start with a process audit. Identify tasks that are high-volume, rule-based, and time-consuming. The best first automation is one that saves measurable time within 30 days. Across 120+ projects, the highest-ROI starting points are usually customer onboarding, invoice processing, and report generation.

How do I calculate ROI on an AI investment?

Measure the hours spent on the process before automation, multiply by fully loaded hourly cost, then subtract the tool cost. Most small business automations cost £50-500/month and save 5-20 hours per week. That typically means 300-1000% ROI in year one.

Which AI tools are best for business use in 2026?

For content and communication, Claude and ChatGPT lead. For data analysis, Gemini and GPT work well with spreadsheets. For automation, Zapier, Make.com, and N8N connect AI to your existing tools. The best tool is the one your team will actually use and maintain.

Put This Into Practice

I use versions of these approaches with my clients every week. The full templates, prompts, and implementation guides, covering the edge cases and variations you will hit in practice, are available inside the AI Ops Vault. It is your AI department for $97/month.

Want a personalised implementation plan first?Book your AI Roadmap session and I will map the fastest path from where you are now to working AI automation.

← Back to Blog