Building a Zero-Bloat MCP Stack for Reducing AI token usage

As I leaned harder on AI coding tools, one pattern kept repeating: better tooling often produced messier sessions.

That sounds backwards, but heavy agent use exposes it fast. Every useful tool tries to hand the model more data: more files, more logs, more terminal output, more MCP tools, more project context, more instructions, and more workflow rules.

Each piece helps on its own. Together, they become context bloat.

The model starts carrying too much. A test command dumps pages of output. Tool lists get large before the task starts. Codebase explanations get rebuilt from scratch. The assistant reads raw files when it really needs a map. Conversation compacts, drops detail, and you end up re-explaining work that already happened.

I built my stack around one idea:

AI coding agents should have a consistent workflow, but they should only see context they actually need.

That became my Zero-Bloat MCP Agent Stack: dotfiles-style setup for Codex/Claude-style coding agents that combines routing, compression, codebase mapping, and structured workflow commands.

Project repository: github.com/JacobThree/zero-bloat-mcp-stack

Problem: Helpful Tools Become Context Debt

A lot of AI coding advice says some version of “give model more context.” That is true, but incomplete.

Context is an attention budget, not only memory.

When an agent reads a file, sees long terminal output, loads a giant system prompt, or registers dozens of tools, that information does not disappear after the step. It sticks around. Future turns keep carrying it. Even when information was useful once, it may no longer be relevant.

The hidden cost is degraded focus and wasted spend.

Bloat often comes from doing the right things: running tests, reading files, exposing useful MCP servers, and writing workflow instructions. Without a system for deciding what gets shown, when it gets shown, and how much detail passes through, workflow gets noisy.

I wanted the benefits of a powerful tool stack without turning every session into a junk drawer.

Philosophy: Route, Compress, Map, Constrain, Repeat

Stack is built around five ideas:

Route tools instead of exposing everything at once.
Compress terminal and tool output before it enters context.
Map codebase structurally instead of making agent rediscover it repeatedly.
Use workflow skills so agent follows predictable development process.
Keep responses and actions intentionally minimal unless detail is needed.

This is what I mean by zero-bloat. Not zero context. Not minimalism for its own sake. Goal is curated context.

Agent should still be powerful. It should still inspect code, run tests, query tools, and reason about architecture. But it should not drag the entire workshop into every conversation.

Architecture

flowchart TD
    A["Developer Task"] --> B["Workflow Layer<br/>spec -> plan -> build -> test -> review -> ship"]
    B --> C["Tool Routing Layer<br/>n2-QLN"]
    C --> D["Execution Layer<br/>RTK + Shell"]
    C --> E["Context Processing Layer<br/>Context-Mode"]
    C --> F["Codebase Map Layer<br/>Graphify"]

    D --> G["Filtered Command Output"]
    E --> H["Reduced MCP/API Payloads"]
    F --> I["Architecture-Level Query Results"]

    G --> J["Agent Working Context"]
    H --> J
    I --> J

    J --> K["Focused Decisions and Edits"]
    K --> L["Lower Token Load + Higher Session Stability"]

    M["Caveman Response Constraint"] --> J
    N["Skill Playbooks"] --> B

The stack works as a pipeline, not a bag of tools. Routing decides what to call. Filters reduce noise before context. Graph data prevents repeated rediscovery. Workflow commands keep session structure stable.

Tool 1: RTK (Rust Token Killer)

RTK solves one obvious problem: terminal output is wasteful.

Commands like npm test, pytest, git diff, git status, tree, and docker ps can produce huge output, but the agent usually needs only the important parts.

Instead of asking the model to summarize massive output after it has already entered conversation, RTK filters and compresses output before the model processes it.

That distinction matters.

Once raw output enters context, cost is already paid. Worse, the model now has to sift through noise. RTK changes the default shape of terminal feedback to: show failure, relevant lines, and next useful clue.

Terminal output is not just data. It is an interface between agent and development environment. If that interface is noisy, the agent behaves noisily. If terminal output is concise, the agent stays focused.

RTK is first-pass filter.

Tool 2: Context-Mode

RTK helps shell output, but agents also generate bloat through tool calls: MCP responses, browser snapshots, API payloads, logs, and file-heavy operations.

Context-Mode treats model as reasoning layer, not bulk data-processing layer.

If agent needs to search, count, filter, or inspect large output, it should not pull entire pile into chat and think over it manually. It should run code or sandboxed process to extract small answer it needs.

Workflow changes from:

Read huge output -> paste into context -> ask model to find useful part

to:

Store/index huge output locally -> query or summarize -> send only useful part to model

Context-Mode is sandbox and summarization layer. It gives access to large outputs without forcing conversation to become landfill.

Tool 3: n2-QLN

MCP is powerful because agents can call external tools. Downside: tool lists can become bloat.

If every tool is exposed all the time, the model carries descriptions, schemas, and decision overhead before the task starts. With large toolsets, this gets messy quickly.

n2-QLN treats tools more like searchable index than giant static menu.

Instead of showing every tool always, agent routes through one semantic layer: search for needed tool, then call selected tool.

This avoids two problems:

Context cost: every tool definition takes space.
Decision cost: model compares too many options and can pick wrong one.

Too many tools can make agent less reliable, not more capable. Curated router often makes tool-heavy environment feel simpler because agent reasons over relevant subset.

n2-QLN is tool gatekeeper.

Tool 4: Graphify

A frustrating AI coding pattern: an agent tries to understand a project by repeatedly grepping files and rebuilding architecture from fragments.

Graphify approaches this differently. Instead of treating a codebase as a pile of files, it builds a structural knowledge graph of the project: important nodes, relationships, communities, dependencies, and architectural connections.

Value is not just token reduction. It changes the questions the agent can ask.

Without graph, questions are file-level:

Which files mention auth?
Where is this function defined?
What imports what?

With graph, questions become structure-level:

What are core components?
Which modules are central?
What parts are unexpectedly connected?
Where should I look first?

Raw files are often the wrong format for architecture reasoning. A source file is great when editing it, but less useful when understanding a full system.

Graphify is map layer.

Tool 5: Caveman + Agent Skills

Last layer is about consistency.

I wanted a workflow I could reuse across projects, so lifecycle maps into one command chain:

/spec -> /plan -> /build -> /test -> /review -> /ship

Agent Skills turn recurring engineering behaviors into modular playbooks. Instead of improvising each session, the agent follows a predictable path: define work, plan it, build incrementally, test, review, then prepare to ship.

Caveman adds a style constraint: less filler, fewer long explanations, and more direct action.

This layer is easy to underestimate. Workflow bloat is real too. If the agent changes style and process every session, you spend energy steering it back. Skills provide repeatability; Caveman keeps that repeatability from becoming verbose ceremony.

Stack as System

Each tool attacks different bloat source:

Layer	Tool	Problem Reduced
Terminal output	RTK	Noisy command output entering context
Tool output	Context-Mode	Large tool results flooding conversation
Tool discovery	n2-QLN	Too many tools exposed at once
Codebase understanding	Graphify	Repeated architecture rediscovery via raw files
Workflow behavior	Caveman + Agent Skills	Inconsistent process and verbose responses

None of these tools alone fully solves the problem.

RTK does not solve tool-list bloat. Graphify does not solve noisy terminal logs. Agent Skills do not solve giant tool outputs. n2-QLN does not force sane engineering process.

Value is in combination.

What I Learned

Biggest lesson: more context is not always better. Better context is better.

A good AI development setup should be selective. It should preserve important decisions, summarize noisy output, expose the right tools at the right time, and give the model a structured path through work.

Token efficiency is not only cost optimization. It is quality control. When context fills with stale logs, irrelevant files, duplicated explanations, and old tool output, the agent has a harder time staying on task.

Reducing bloat is reliability improvement.

What I Want to Improve Next

Install profiles: Minimal, standard, and full variants for different users.
Validation depth: End-to-end checks proving route/query/compress/workflow behavior.
Benchmarks: Before-and-after metrics for token usage and unnecessary reads.
Portability: Safer cross-agent config for Codex, Claude Code, Cursor, and others.

Final Thought

The more I use AI coding tools, the more I think workflow design matters more than prompting alone.

Raw model can help you code. Well-designed agent environment helps you code consistently.

That environment needs rules, maps, filters, routing, and repeatable phases. It should make efficient path default path.

That is the goal of the Zero-Bloat MCP Agent Stack: use powerful tools without drowning the agent in them, keep workflow moving without constant resets, and make the assistant see less but understand more.