The Future of Agentic Tooling: MCP Servers vs. CLI A Data-Driven Comparison
A data-driven breakdown of token costs across CLI, skill files, Native MCP, and Gateway patterns, and a decision framework for choosing the right tool architecture.

As Large Language Models (LLMs) evolve into autonomous coding agents, one of the most consequential architectural decisions is deceptively simple: how should an AI agent talk to external services?
Traditionally, we gave LLMs terminal access and let them invoke Command Line Interfaces (CLIs). But in late 2024, Anthropic introduced the Model Context Protocol (MCP), marketed as the "USB-C of AI", a structured alternative that lets agents interact with services via typed JSON schemas rather than shell commands and plain text output. The hype was immediate and enormous. Thousands of MCP servers were published within weeks, and every AI assistant rushed to add support.
The MCP Backlash: Why Developers Are Questioning the Hype
But in 2025, a quiet counter-current began to emerge. Developers building real agentic systems started noticing something uncomfortable: the more MCP servers they connected, the slower, dumber, and more expensive their agents became.
The core complaint is what engineers have started calling "context window bloat". Unlike CLI tools, which an LLM can explore lazily via --help, MCP requires all registered tool schemas to be injected into the system prompt upfront. A single GitHub MCP server with ~35 endpoints contributes roughly 3,000 tokens of tool definitions to every single request, before the agent writes a single line. Connect five MCP servers (GitHub, Slack, Kubernetes, Linear, Postgres) and you're burning 15,000+ tokens per request just describing tools the agent may never call. At scale, this can consume 25β50% of the entire context window before the agent begins reasoning.
Researchers at lunar.dev documented another failure mode: tool-space interference. As tool counts rise, agents struggle to distinguish between similarly named tools (e.g., get_status, fetch_status, query_status), causing poor tool selection and cascading failures. Meanwhile, discussions on Reddit and communities like The New Stack are increasingly questioning whether MCP's architectural overhead is justified for local or single-service workflows.
The developer community has also noted that frontier models are already heavily trained on common CLI tools, git, gh, kubectl, curl, often knowing the right flags without any schema description at all. As chrlschn.dev observed: "Progressive disclosure via --help might actually be more token-efficient than loading a 3,000-token schema you only use once."
So: is the MCP backlash justified? Or are developers throwing the baby out with the bathwater? We ran a real experiment to find out, testing identical GitHub operations across four distinct approaches with measured token data.
The Experiment
We tested four distinct modalities for completing identical GitHub operations:
| ID | Approach | Description |
|---|---|---|
| A | gh CLI (raw) |
Shell commands, plain-text output |
| A2 | gh CLI + Skill |
Shell commands guided by a skill.md file |
| B | Native GitHub MCP | Directly injected JSON tool schemas |
| C | Nexus-Dev Gateway | Single routing tool, schemas loaded lazily |
The Workflow
Each modality performed the same four operations:
Create a new public repository
Create an issue: "Test Issue for Evaluation"
Post a comment on the issue
List and retrieve the open issues
All four phases succeeded without errors. But looking at the token consumption tells a very different story.
Token Consumption: The Real Numbers
Measurement method: Characters divided by 4, matching the cl100k_base tokenizer approximation used by most frontier LLMs.
Per-Interaction Tokens (the 4 operations above)
| Modality | Input Tokens | Output Tokens | Total |
|---|---|---|---|
| CLI (raw) | 74 | 150 | 224 |
| CLI + Skill | 95 | 149 | 244 |
| Native MCP | 86 | 121 | 207 |
| Nexus Gateway | 135 | 111 | 246 |
Fixed Context Overhead (loaded once per session)
| Modality | Schema Overhead |
|---|---|
| CLI (raw) | 0 tokens β no upfront schema |
| CLI + Skill | 480 tokens β skill file loaded once |
| Native MCP | ~3,062 tokens β all 35 tool schemas always present |
| Nexus Gateway | ~20 tokens β single router schema |
Total Cost Formula
For a session with N operations:
| Modality | Formula | N=10 | N=50 | N=200 |
|---|---|---|---|---|
| CLI (raw) | 224N |
2,240 | 11,200 | 44,800 |
| CLI + Skill | 480 + 244N |
2,920 | 12,680 | 49,280 |
| Native MCP | 3,062 + 207N |
5,132 | 13,412 | 44,462 |
| Nexus Gateway | 20 + 246N |
2,480 | 12,320 | 49,220 |
The Skill File: A Middle Ground
Before MCP was widely adopted, teams developed skill files, structured markdown documents injected into the LLM's context that document exact commands, flags, and output formats. Think of it as a mini-manual the agent reads before acting.
The full skill file used in this experiment is available as a public Gist: github-cli.skill.md
What the GitHub CLI Skill Provides
# Skill: GitHub CLI (`gh`) Operations
## Issue Operations
# Always prefer --json for structured output:
gh issue list -R <owner>/<repo> --json number,title,body,state,comments
gh issue create -R <owner>/<repo> --title "<title>" --body "<body>"
gh issue comment <issue-number> -R <owner>/<repo> --body "<comment>"
Impact on Output Quality
Without the skill (gh issue list), the LLM receives:
Showing 1 of 1 open issue in mmornati/mcp-cli-test-repo
ID TITLE LABELS UPDATED
#1 Test Issue for Evaluation less than a minute ago
β ASCII table with alignment whitespace. No machine-readable structure. Dates are relative ("less than a minute ago"), not parseable timestamps. Plus upgrade noise.
With the skill (gh issue list --json number,title,body,state,comments):
[{"body":"This is a test issue created via CLI with Skill","comments":[{"id":"IC_kwD...","body":"This is a comment via CLI with Skill","createdAt":"2026-04-27T19:04:12Z"}],"number":1,"state":"OPEN","title":"Test Issue for Evaluation"}]
β Structured, parseable, no noise. Absolute timestamps. Machine-readable IDs.
Break-Even Analysis
The skill file costs 480 tokens upfront. In exchange, per-operation output quality improves dramatically.
CLI vs. CLI+Skill cross-over: The skill overhead is recovered after ~24 operations within a session, after which output token reduction compounds.
The real gain from the skill isn't just tokens, it's eliminating the discovery loop where the LLM has to run
gh --helporgh issue --helpto find flags. Each help invocation typically costs an additional 400β800 tokens of output to parse.
MCP: Structured by Design
Native MCP (Direct Schema Injection)
With the GitHub MCP server, the LLM doesn't need to discover anything. Every tool is pre-described in the system prompt:
// Input β compact and typed:
{"body": "This is a test issue", "method": "create", "owner": "mmornati", "repo": "mcp-native-test-repo", "title": "Test Issue for Evaluation"}
// Output β clean JSON, no table formatting:
{"id": "4338179062", "url": "https://github.com/mmornati/mcp-native-test-repo/issues/1"}
The per-interaction token cost is the lowest of all four modalities (207 tokens). However, the 3,062 token fixed overhead is significant, and it's always there, even when the agent isn't using GitHub at all.
If your AI assistant has 5 MCP servers active (GitHub, Slack, Kubernetes, Linear, Postgres), you're paying 15,000+ tokens per request just to describe tools the agent may never call. At scale with long-running sessions, this quickly becomes the dominant cost.
Nexus-Dev Gateway (Lazy Schema Loading)
The Nexus-Dev gateway approach is architecturally elegant: inject a single routing tool (invoke_tool) with ~20 tokens of schema overhead, and let the agent request tool schemas on demand when it needs them.
// The agent dispatches to any server with one tool:
{"server": "github", "tool": "issue_write", "arguments": {"owner": "mmornati", ...}}
The per-operation tokens are slightly higher (246) because each call includes the routing envelope, but the fixed overhead is essentially zero regardless of how many backend servers are configured.
The Real Dev Session: A Completely Different Picture
All the numbers above measure the cost of calling GitHub. But that's not how real coding sessions work.
When a developer opens Claude Code, Cursor, or Antigravity and starts a feature, the actual pattern looks like this:
Prompt 1: "Fetch issue #42 and help me plan the implementation" β 1 GitHub op
Prompts 2β19: Coding, debugging, refactoring, tests, code review questions β 0 GitHub ops
Prompt 20: "Create a PR with my changes and link it to the issue" β 1 GitHub op
In this 20-prompt session, GitHub was called twice. But some of our modalities charge you for GitHub on every single prompt, whether you call it or not.
This is the critical question: when is the overhead paid?
| Modality | When is overhead charged? |
|---|---|
| CLI (raw) | Only when a gh command is actually run |
| CLI + Skill (on-demand) | Once, when the developer explicitly invokes it |
CLI + Skill (always-on, e.g. in .cursorrules) |
Every prompt |
| Native GitHub MCP | Every prompt (schemas always in system prompt) |
| Nexus Gateway | Every prompt, but only ~20 tokens |
Token Cost Across a Full Dev Session (G=2 GitHub ops)
The following table shows the real total token cost for a dev session where GitHub is called exactly twice (fetch issue + create PR), and the rest of the session is pure coding:
| Session length | CLI (raw) | CLI+Skill (on-demand) | CLI+Skill (always-on) | Native GitHub MCP | Nexus Gateway |
|---|---|---|---|---|---|
| N=5 prompts | 448 | 968 | 2,888 | 15,724 | 592 |
| N=10 prompts | 448 | 968 | 5,288 | 31,034 | 692 |
| N=20 prompts | 448 | 968 | 10,088 | 61,654 | 892 |
| N=50 prompts | 448 | 968 | 24,488 | 153,514 | 1,492 |
| N=100 prompts | 448 | 968 | 48,488 | 306,614 | 2,492 |
Data generated by
session_token_model.py. Token approximation: 1 token β 4 chars.
The numbers are staggering. In a 50-prompt dev session with 2 GitHub operations:
CLI (raw) costs 448 tokens total for GitHub, exactly what those 2 calls require.
Nexus Gateway costs 1,492 tokens, still negligible.
Native GitHub MCP costs 153,514 tokens, of which 99.7% is wasted on schema descriptions the agent never needed for 48 of the 50 prompts.
The Schema Tax: How Bad Is It Really?
For a 20-prompt session with 2 GitHub operations, the Native MCP breakdown is:
| Tokens | % of total | |
|---|---|---|
| Schema overhead (3,062 Γ 20 prompts) | 61,240 | 99.3% |
| Actual GitHub work (2 ops Γ 207 tokens) | 414 | 0.7% |
| Total | 61,654 |
For every 1 token of real GitHub work done, Native MCP charges you 148 tokens in schema tax.
What If GitHub Is Used More Heavily?
For completeness, here is the same session (N=20 prompts) with different GitHub call frequencies:
| GitHub ops (G) | CLI (raw) | CLI+Skill (on-demand) | Native GitHub MCP | Nexus Gateway |
|---|---|---|---|---|
| G=1 (fetch only) | 224 | 724 | 61,447 | 646 |
| G=2 (fetch + PR) | 448 | 968 | 61,654 | 892 |
| G=5 (active use) | 1,120 | 1,700 | 62,275 | 1,630 |
| G=10 (heavy use) | 2,240 | 2,920 | 63,310 | 2,860 |
| G=20 (every prompt) | 4,480 | 5,360 | 65,380 | 5,320 |
Notice that increasing GitHub calls from G=1 to G=20 barely moves the Native MCP number (61,447 β 65,380) because the schema overhead dominates completely. The Nexus Gateway, by contrast, scales almost linearly with actual usage.
Important Note on Skill Files
When a skill file is stored in a project-level configuration (like Cursor's .cursorrules, Claude Code's CLAUDE.md, or Antigravity's skill registry), it is injected into every prompt automatically, meaning it behaves like always-on, costing 480 tokens Γ N prompts. However, if the developer explicitly references the skill file only when performing GitHub operations (on-demand), it costs just 480 tokens once per session. The on-demand model is far more efficient but requires developer discipline.
Synthesis: Which Approach to Use?
Decision Matrix
| Criterion | CLI | CLI + Skill | Native MCP | Gateway MCP |
|---|---|---|---|---|
| Setup complexity | β None | β Minimal | β οΈ Schema authoring | β οΈ Gateway config |
| Per-op tokens | β Low | β Low | β Lowest | β οΈ Moderate |
| Fixed overhead per prompt | β Zero | β Zero (on-demand) | β ~3,062 tokens | β ~20 tokens |
| Session cost (N=20, G=2) | β 448 | β 968 | β 61,654 | β 892 |
| Output reliability | β Brittle text | β οΈ Better w/ --json |
β Typed JSON | β Typed JSON |
| Multi-service scale | β Fine | β Fine | β Explodes context | β Scales linearly |
| Discovery overhead | β High (--help loops) |
β Eliminated | β Eliminated | β On-demand |
| Best for G/N ratio | < 5% | 5β15% | > 40% | 5β40% |
Recommendations
Use raw CLI when:
The service has G/N < 5% (called rarely in a session).
Running one-off scripts in constrained environments.
Use CLI + Skill (on-demand) when:
The service has G/N < 15% but you need reliable structured output.
You want zero overhead except when the service is actually invoked.
β οΈ Do not put the skill file in
.cursorrulesorCLAUDE.md, that makes it always-on and costs 480 tokens Γ N prompts.
Use Native MCP when:
The service has G/N > 40% (called on nearly every prompt).
You have fewer than 2β3 MCP servers loaded simultaneously.
Examples: file system tools, memory/context stores, local databases in data-heavy sessions.
Use a Gateway MCP when:
The agent uses many different services at varying frequencies.
You want MCP-quality structured outputs for medium-frequency services (G/N 5β40%).
This is the recommended default architecture for general-purpose coding agents.
Configuring a Token-Efficient Dev Environment
Given everything we've measured, here is a practical framework for setting up your AI coding environment, whether you're using Claude Code, Cursor, Antigravity, or any similar agent.
The Core Decision: G/N Ratio
For any external service, ask: "In a typical session of N prompts, how many prompts (G) will actually call this service?"
| G/N Ratio | Interpretation | Recommended Approach |
|---|---|---|
| > 40% | Core to almost every prompt | Native MCP (overhead amortizes quickly) |
| 15β40% | Used regularly but not constantly | Gateway MCP |
| 5β15% | Occasional use | Gateway MCP or CLI+Skill (on-demand) |
| < 5% | Rare, session-bookend use | CLI or on-demand skill |
For context: GitHub in a standard feature implementation session has G/N β 10% (2 ops in 20 prompts). That puts it firmly in the "occasional use" zone, which is why Native MCP is such a poor fit despite its per-op efficiency.
Services by Usage Frequency
Not all services are equal. Here's how common MCP-compatible services break down by typical G/N ratio:
π’ High Frequency (G/N > 40%) β Native MCP is justified
| Service | Why itβs high-frequency | CLI Alternative |
|---|---|---|
| File system / code search | Called on nearly every coding prompt | find, grep, cat |
| Memory / context stores (e.g. Nexus) | Constantly queried for project context | β |
| Browser / web rendering | Frequent in front-end sessions | curl (limited) |
| Local database | Core when session is data-focused | psql, sqlite3 |
| Code index / embeddings | Queried for every "find similar" request | β |
For these, the schema overhead amortizes quickly and Native MCP's structured outputs provide real value on every prompt.
π‘ Medium Frequency (G/N 5β40%) β Gateway MCP
| Service | Typical ops/session | Best approach |
|---|---|---|
| Linear / Jira | Fetch board + update tickets | ~5β10 ops |
| Slack / Teams | Check thread, post update | ~2β5 ops |
| Notion / Confluence | Look up docs, update notes | ~2β5 ops |
| Sentry / Datadog | Investigate errors during debug | ~3β8 ops |
| npm / PyPI registry | Check versions when adding deps | ~2β6 ops |
For these, Native MCP schema overhead is hard to justify. A Gateway MCP gives you structured outputs without the fixed cost.
π΄ Low Frequency (G/N < 5%) β CLI or on-demand skill
| Service | Typical use in a session | Better approach |
|---|---|---|
| GitHub | Fetch issue at start, create PR at end | CLI + on-demand skill |
| Kubernetes / Helm | Deploy once at the end of a feature | kubectl + skill |
| AWS / GCP / Azure | Infra provisioning, rarely mid-session | aws/gcloud CLI |
| Stripe / payment APIs | Verify test payments occasionally | curl to API |
| DNS / domain tools | One-off lookups | dig, nslookup |
For these, the Native MCP schema tax is almost entirely wasted. CLI with a skill file loaded on-demand costs zero overhead when idle and delivers clean structured JSON when explicitly invoked.
Practical Configuration Guide
Step 1: Audit your MCP config. For each server, estimate G/N. A typical config that naively loads 5 servers wastes ~15,000 tokens per prompt:
// BEFORE: 5 servers = ~15,000 tokens/prompt in schema overhead
{
"mcpServers": {
"github": { "command": "npx", "args": ["@github/mcp-server"] },
"kubernetes": { "command": "npx", "args": ["mcp-server-kubernetes"] },
"slack": { "command": "npx", "args": ["@slack/mcp-server"] },
"aws": { "command": "npx", "args": ["awslabs.aws-mcp-servers"] },
"linear": { "command": "npx", "args": ["linear-mcp-server"] }
}
}
Step 2: Apply the G/N framework:
// AFTER: token-efficient configuration
{
"mcpServers": {
// β
Native MCP: G/N > 40%, core to every prompt
"memory": { "command": "npx", "args": ["@modelcontextprotocol/server-memory"] },
"filesystem": { "command": "npx", "args": ["@modelcontextprotocol/server-filesystem"] },
// β
Gateway: routes to slack, linear, sentry on-demand; single ~20-token schema
"nexus-gateway": { "command": "npx", "args": ["nexus-dev-gateway"] }
// β
github, kubernetes, aws β moved to CLI with on-demand skill files
}
}
Step 3: Store skill files per-service, invoke on-demand:
project/
βββ .skills/
β βββ github-cli.skill.md β invoke when doing GitHub ops
β βββ kubernetes.skill.md β invoke when deploying
β βββ aws-cli.skill.md β invoke when touching infra
βββ CLAUDE.md β only truly global rules here (keep minimal)
β οΈ Never put skill files in
.cursorrulesorCLAUDE.mdunless you want them loaded on every prompt. Always-on skill files behave like miniature MCP schemas, costing 480 tokens Γ N prompts across your session.
Other Services Worth Testing
The session-frequency framework applies consistently across the ecosystem:
| Service | CLI | MCP Server | Typical G/N | Recommendation |
|---|---|---|---|---|
| GitHub | gh |
github-mcp-server | ~5β10% | CLI + on-demand skill |
| Kubernetes | kubectl |
kubernetes-mcp-server | ~2β5% | CLI + on-demand skill |
| PostgreSQL | psql |
postgres-mcp-server | ~40β80% (data sessions) | Native MCP if data-heavy |
| Slack | curl + API |
slack-mcp-server | ~10β20% | Gateway MCP |
| Linear / Jira | curl + API |
linear-mcp-server | ~15β25% | Gateway MCP |
| AWS | aws CLI |
awslabs aws-mcp | ~2β5% | CLI + on-demand skill |
| Sentry | curl + API |
sentry-mcp-server | ~10β30% | Gateway MCP |
| Filesystem / Memory | find, cat |
mcp-server-memory | ~60β90% | Native MCP |
The right answer depends on your session type. A PostgreSQL-heavy data analysis session has a completely different G/N profile than a feature implementation session where you only touch the database at the end for a schema migration.
The Happy Path Assumption: What These Numbers Don't Include
Everything measured so far assumes that the LLM always picks the right tool, uses the correct flags, and generates valid parameters on the first attempt. In reality, it does not.
This is an important caveat. Our token numbers are lower bounds, they represent the ideal case. Real-world agentic sessions are messier.
What Research Says About LLM Tool Accuracy
The Berkeley Function Calling Leaderboard (BFCL) is the most widely cited benchmark for tool-use accuracy. Key findings from 2024β2025:
Simple, single-turn function calls: Top models (GPT-4o, Claude 3.5/4, Gemini) achieve >90% accuracy in well-defined, isolated scenarios.
Complex, multi-turn, agentic tasks: Accuracy drops significantly, BFCL v4 (2025β2026) shows an average of ~58% across all models, with top models scoring ~73% on the comprehensive test suite.
Relevance detection (knowing when not to call a tool) remains a consistent weak point.
These numbers don't directly map to our scenarios, but they establish an important baseline: even top models fail to pick or call the right tool correctly 10β40% of the time in complex, multi-tool environments.
What Failure Costs in Tokens
When an agent picks the wrong tool or uses wrong parameters, the typical failure cycle looks like:
1. Agent generates wrong tool call / flag β ~100 tokens output
2. Tool returns error message β ~50β200 tokens input
3. Agent reasons about the error and retries β ~150 tokens output
4. (Repeat 1β3 until correct or max retries)
A single recoverable error costs approximately 300β600 extra tokens. Research on ReAct-style agents shows that in some pipelines, over 90% of retry attempts target errors that are structurally impossible to fix (hallucinated tool names, invalid parameters), meaning the agent burns tokens on loops that can only end in a human intervention or hard reset.
How This Affects Each Modality Differently
| Modality | Primary failure mode | Retry overhead | Why |
|---|---|---|---|
| CLI (raw) | Hallucinated flags, wrong subcommand | High | Text-only interface; agent infers syntax from training data alone |
| CLI + Skill | Wrong flag despite skill guidance | Medium | Skill pre-empts most common mistakes; some edge cases remain |
| Native MCP | Wrong tool selection from large schema | Medium-Low | Typed schema prevents parameterization errors; tool confusion risk grows with schema size |
| Nexus Gateway | Misrouted request | Low | Single router with clear semantic labels; schema enforced downstream |
The counterintuitive finding from research: Native MCP actually has lower parameterization error rates than raw CLI for the same service, because the LLM receives a precise, typed function signature instead of inferring flags from documentation. The schema overhead is costly, but it does reduce one class of errors.
Estimating Real-World Overhead
Without precise empirical data for our specific workflows, we can estimate the error-adjusted token cost using a conservative assumption: in a real coding session, the agent makes ~1 recoverable error per 5 GitHub-related operations (a 20% error rate, consistent with mid-range BFCL performance on agentic tasks).
With G=2 GitHub operations per session, this rounds to approximately one extra error/retry cycle per 2β3 sessions, so the per-session error overhead is small but non-zero:
| Modality | Ideal session cost (N=20, G=2) | Error-adjusted estimate (+1 retry per session @ 400t) |
|---|---|---|
| CLI (raw) | 448 | 848 (+89%) |
| CLI + Skill (on-demand) | 968 | 1,168 (+21%) |
| Native MCP | 61,654 | 61,954 (+0.5%) |
| Nexus Gateway | 892 | 1,092 (+22%) |
Key observation: Error overhead hits CLI (raw) hardest in relative terms (+89%), because there is no schema to catch bad parameters before execution. For Native MCP, the error adjustment is statistically invisible (0.5%), because the schema tax already dominates by orders of magnitude.
β οΈ Methodology note: All token numbers in this post represent the happy-path, single-attempt scenario. The error-adjusted estimates above are extrapolated from BFCL benchmark data and general research on ReAct-style agents. They are approximations, not measured values. Real error rates vary significantly based on model, prompt quality, schema clarity, and task ambiguity. Your mileage will vary.
Conclusion
Our experiment with real GitHub operations, creating repos, opening issues, posting comments, and querying results, confirms that the MCP backlash is partially right, but misses the real solution.
The Numbers That Matter Most
The isolated per-operation cost tells one story:
| Modality | Per-op tokens | Session overhead |
|---|---|---|
| CLI (raw) | 224 | 0 |
| CLI + Skill (on-demand) | 244 | 480 (once) |
| Native GitHub MCP | 207 | 3,062 per prompt |
| Nexus Gateway | 246 | 20 per prompt |
But the real dev session cost (N=20 prompts, G=2 GitHub ops, fetch issue + create PR) tells the story that actually matters:
| Modality | Real session cost | vs. CLI baseline |
|---|---|---|
| CLI (raw) | 448 tokens | β |
| CLI + Skill (on-demand) | 968 tokens | 2.2Γ |
| Nexus Gateway | 892 tokens | 2.0Γ |
| CLI + Skill (always-on) | 10,088 tokens | 22Γ |
| Native GitHub MCP | 61,654 tokens | 137Γ |
At N=50 prompts, Native MCP reaches 153,514 tokens for the same 2 GitHub calls, with 99.7% wasted on schema descriptions that were never needed.
Three Conclusions
1. The MCP backlash is real, but the target is wrong. The problem isn't the protocol. It's the native injection pattern, loading every schema into every prompt. Returning to raw CLI trades one set of problems (schema bloat) for another (brittle text output, --help discovery loops).
2. Service frequency (G/N ratio) is the missing variable in every MCP vs. CLI debate. GitHub has G/N β 5β10% in a typical dev session, it belongs in the CLI+skill zone. File system tools and memory stores have G/N > 60%, they belong in the Native MCP zone. Design your toolchain around your actual usage patterns, not the hype cycle.
3. The gateway pattern is the right default architecture. Near-zero fixed overhead (~20 tokens/prompt), MCP-quality structured outputs, and linear scaling regardless of how many backend services you configure. Pair it with on-demand skill files for low-frequency CLI services, and keep Native MCP only for the tools your agent calls on nearly every prompt.
The question every team building AI agents should be asking is not "should we use MCP?" but: "What is the G/N ratio of this service in our sessions, and are we paying the schema tax unnecessarily?"
All tests were run using the GitHub CLI v2.89.0, the github-mcp-server, and the nexus-dev gateway on macOS. The CLI skill file is published as a public Gist. Token estimates use the 1 token β 4 characters approximation (cl100k_base).





