Headroom: Cut Your AI API Bill 60-95% for Free

Headroom is a free, Apache 2.0 open-source context compression layer that intercepts everything your AI agents send to the LLM API before it becomes a billable token, cutting usage by 60 to 95 percent. At published Claude and OpenAI pricing rates, that reduction applies directly to the dollar bill, including the $2,000-per-engineer-per-month token spend that just made Microsoft cancel Claude Code licenses for thousands of engineers.

headroom is a free, Apache 2.0 open-source context compression layer that intercepts everything your AI agents send to the LLM API before it becomes a billable token, and compresses it to 5 to 40 percent of its original size while preserving the information the model needs to answer correctly. At published June 2026 rates for Claude Sonnet 4.6 at $3 per million input tokens, the token reduction maps directly to a dollar reduction of the same magnitude. That math gets sharper with a reference point: Microsoft's Experiences and Devices division canceled Claude Code licenses for thousands of engineers in June 2026 because token billing had reached $2,000 per engineer per month. Headroom is the kind of tool that looks different when that number is in the news.

The Problem It Solves

Token costs in agentic workflows are not a pricing anomaly. They are a structural feature of how AI agents work.

Agents do not send clean, minimal messages to the LLM. They send context: the full conversation history from the beginning of the session, every tool call result they have received, every file they opened to check, every log they scanned, every chunk retrieved from a vector database. Each of those items lands in the input token count. And because every API call resends the entire accumulated conversation, a session that runs for two hours can burn through millions of tokens to produce a few thousand lines of usable output.

The ratio of input to output is typically ten to one or worse. Input tokens are where the bill accumulates. And the content being sent is rarely as compact as it could be: tool outputs include verbose formatting and metadata, code search results return full file contents, RAG retrievals return entire passages when a paragraph would suffice.

The gap between what gets sent and what the model actually needs is where headroom operates.

What It Does in Business Terms

Headroom places a compression layer between your application and the LLM provider. Before anything reaches the API, it routes each piece of content through a specialized compressor matched to the content type. Structured JSON goes through one compressor, source code through a second that understands the syntax tree and can strip comments and dead branches, prose and logs through a third trained on real agentic traces. A fourth module stabilizes prompt prefixes to improve how often provider-side cache hits land, which multiplies the savings from Anthropic's and OpenAI's own published cache pricing.

The benchmarks on real agent workloads are specific enough to use in a decision:

Code search results: 92 percent reduction (17,765 tokens to 1,408)
SRE incident debugging: 92 percent reduction (65,694 tokens to 5,118)
GitHub issue triage: 73 percent reduction (54,174 tokens to 14,761)
Codebase exploration: 47 percent reduction (78,502 tokens to 41,254)

The originals are not discarded. Headroom stores them locally, and the model can retrieve the full version of anything when it needs more detail. On accuracy benchmarks covering math reasoning, factual QA, reading comprehension, and tool use, scores hold steady or improve slightly after compression. Stripping noise appears to help the model focus on the signal.

Setup

For a team using coding agents such as Claude Code, Codex, or Cursor, the entry point is:

pip install "headroom-ai[all]"
headroom wrap claude

That wraps the agent's outbound API calls automatically, with no changes to existing code. The alternative integration paths are a local drop-in proxy (point any SDK at localhost:8787 instead of the provider URL, which works across any language or framework), a Python or TypeScript library for inline use, and an MCP server mode that exposes compression as callable tools inside any Claude-compatible environment.

There is no database to configure and no cloud account to create. Compression runs locally. Data never leaves the machine.

Honest Limitations

Headroom is open-source software with no SLA and no support contract. The team behind it is small, and there is no guarantee the compression models will stay current as LLM behaviors and tool output formats evolve.

The gains vary considerably by content type. The 47 percent reduction on codebase exploration is real but leaves more than half the tokens on the table. For teams whose primary cost driver is deep codebase analysis, the savings are meaningful but not dramatic.

Each call adds 15 to 200 milliseconds of latency, which is negligible for interactive workflows but can accumulate in high-frequency automated pipelines. The prose compression model requires a local process to run, which is an additional dependency to audit and maintain.

Headroom is also not a substitute for architectural choices like model routing, prompt caching, or switching to cheaper models for simpler tasks. It works best as one layer in a broader cost-management approach rather than as a standalone solution.

The Business Case

The math scales with API spend. If ten engineers are using Claude Code at the Anthropic-reported average of $150 to $250 per developer per month, the current annual spend is $18,000 to $30,000. A 70 percent reduction brings that to $5,400 to $9,000. At the $2,000 per engineer levels that triggered the Microsoft cancellation, ten engineers are running $240,000 per year. The same 70 percent compression produces a $168,000 annual saving, with the token bill dropping to $72,000.

These figures do not require assuming the best-case 92 percent result. The 47 percent floor on codebase exploration, applied uniformly, still produces a line-item reduction with no changes to the agents or the work they do.

The target user is not a solo developer on a $20 monthly subscription. It is a team running agents in production against the API, paying token rates, watching the bill grow in proportion to how useful the agents are, and looking for a lever that does not require switching providers or reducing scope.

That last constraint is worth sitting with. LLM API costs scale with value. A more capable agent doing more work costs more, which is structurally different from every other category of software a finance team has ever approved. Headroom does not change that model. It just closes the gap between what the model gets billed for and what it actually needs to see.

The Microsoft story is not really about Microsoft. It is about what happens when the usefulness of a tool and the cost of a tool move in the same direction. The cancellation answer is to make the agents less capable or the billing less usage-based. Headroom is the third option.

The Token Bill That's Making Companies Cancel AI Tools Has a Free Fix

The Problem It Solves

What It Does in Business Terms

Setup

Honest Limitations

The Business Case