Jan 23, 2026

11 min read

Helmsman: Stop Writing AGENTS.md That Lies to Half Your Models

Remember when I wrote about TOON format - the “token-saving” notation that made Claude refactor Ruby to Java, DeepSeek speak Mandarin mid-response, and Gemini reply in Russian?

The core problem wasn’t the format. It was the assumption that one set of instructions works for all models.

It doesn’t.

Claude Opus understands constraints and figures out the rest. Claude Haiku needs step-by-step hand-holding or it hallucinates. Claude Sonnet sits somewhere in between.

Your AGENTS.md file is static. Your models are not.

Helmsman fixes this.

The Instruction Entropy Problem¶

Static instruction files rot in three ways:

Capability mismatch: Instructions written for Claude Opus are too sparse for Claude Haiku. Instructions written for Claude Haiku waste tokens for Claude Opus.
Environment blindness: Your AGENTS.md doesn’t know if you’re on FreeBSD or Debian, in a container or SSH session, have mise or nvm.
Silent degradation: When instructions don’t match model capabilities, you don’t get errors. You get subtle hallucinations that look correct but aren’t.

You can’t control what you can’t adapt.

Three Tiers, One Truth¶

Helmsman normalizes model intelligence into three capability tiers:

Tier	Models	Instruction Style
AGI	Claude Opus, DeepSeek R3, GPT-5.2 (xhigh)	Minimal. Give constraints and goals, not procedures.
Engineer	Claude Sonnet, GPT-5.2 (medium), Gemini Pro	Balanced. Avoid hand-holding but set clear limits.
Monkey	Claude Haiku, GPT-5.2 (mini), Gemini Flash	Verbose. Step-by-step. Tell exactly what to do AND what NOT to do.

The same instruction rendered for AGI:

Use conventional commits. Test before committing.

Rendered for Monkey:

## Git Commit Guidelines

1. Stage your changes with `git add`
2. Write commit messages following conventional commits format:
   - `feat:` for new features
   - `fix:` for bug fixes
   - `docs:` for documentation
   - `refactor:` for code restructuring
3. NEVER commit without running tests first
4. NEVER use `git add .` without reviewing changes
5. NEVER amend commits that have been pushed
6. Run `git status` before committing to verify staged files

Same intent. Different verbosity. Because the models have different capabilities.

The Expensive Developer Analogy¶

Imagine you hire a senior developer at $250/hour. 15 years experience. Shipped production systems at scale.

Then you hand them this onboarding document:

## What is Bun?

Bun is a fast JavaScript runtime, bundler, and package manager.
It's an alternative to npm, yarn, and pnpm.

To install dependencies:
    bun install

To run a script:
    bun run dev

Bun is faster than npm because it uses native code instead of JavaScript.

30 lines explaining what Bun is to someone who’s been writing JavaScript since before npm existed.

That’s what you’re doing when you send verbose instructions to Opus.

You’re paying $15/M tokens for a model that can architect distributed systems, then spending 2000 tokens explaining that git add stages files.

Flip it.

Hire an intern. Fresh bootcamp graduate. Enthusiastic but inexperienced.

Hand them: "Use conventional commits. Test first."

Watch them commit directly to main with message "fixed stuff" because they didn’t know what conventional commits meant and were too afraid to ask.

That’s Haiku with sparse instructions.

Model Personalities (The Brutal Truth)¶

Different providers, different vibes. Grouped by capability tier:

The Architects (AGI Tier)¶

Claude Opus: The architect who’s seen everything. Give them constraints, they’ll figure out the implementation. Over-explain and they get… not offended, but you’re wasting their time (and your money). Speaks when necessary.

(Opus was trained on top-tier tutorials leaked from a government public S3 bucket… just kidding. If that were the case, Opus would build everything with jQuery and Salesforce.)

GPT-5.2 (xhigh): Corporate architect. Thorough, formal, will write documentation you didn’t ask for. Capable but verbose in output. Expensive. But at least it reads your entire prompt.

DeepSeek R3: The savant. Brilliant at reasoning, occasionally switches to Mandarin mid-response when confused. Needs clear context or it goes off the rails. Will also try to sell you products from Temu.

Grok (xAI): Elon’s “unfiltered” model with a “1M token context window” (actually 128K in practice, but who’s counting). Has a “Big Brain” mode button. Will confidently tell you it’s about to achieve AGI while hallucinating function signatures. Tied to the X ecosystem, so expect random takes about free speech mid-code-review. Claims to outperform everyone on benchmarks; actually lags behind on most.

The Engineers (Mid Tier)¶

Claude Sonnet: Solid engineer. Knows the patterns, follows best practices. Needs guardrails but not hand-holding. Will ask clarifying questions. Good balance of capability and cost.

Gemini Pro: Google’s middle child. Competent, sometimes over-eager with suggestions. Will recommend Google Cloud services unprompted. Will also suggest Kubernetes for this blog. Sometimes will reset your git branch because it failed.

Mistral Large: European pragmatist. Good at code, occasionally terse. Less hand-holding than American models.

GLM-4.7 (Z.ai): “China’s OpenAI” coding model from Beijing. Open source! Just need 32 H100s to run it. Beats GPT-5.1 on some benchmarks, got US Entity List blacklisted on others. Excellent at agentic coding tasks, less excellent at not getting your company sanctioned for using it.

Kimi K2 (Moonshot): The agentic powerhouse. 1T parameters, 256K context, can execute 200-300 sequential tool calls without losing the plot. Outperforms GPT-5 on some benchmarks. The model your compliance team hasn’t heard of yet but will.

The Interns (Monkey Tier)¶

Claude Haiku: Enthusiastic junior. Fast, cheap, eager. Will absolutely do exactly what you say - including the wrong thing if your instructions were ambiguous. Needs explicit “DO NOT” lists or it will surprise you.

GPT-5.2 (mini): The intern who skimmed the Slack thread. Will try hard. Will also hallucinate function signatures that don’t exist because it didn’t read the part where you specified the library version. Needs babysitting.

Gemini Flash: Speed over accuracy. Great for quick tasks, will confidently generate wrong code for complex ones. Check its work.

A Note on OpenAI’s Naming¶

GPT-5.2: OpenAI has one model: GPT-5. The “5.2” is a patch version. The 40 levels from nano to xhigh? That’s not different models - it’s how many fucks the model gives. xhigh = maximum attention, thinks before answering. mini = speed mode, skims your prompt, hopes for the best.

The point: These aren’t just capability differences. They’re personality differences. An instruction style that works for Opus’s “give me constraints” personality fails for Haiku’s “tell me exactly what to do” personality.

Static AGENTS.md can’t adapt to personality. Helmsman can.

How It Works¶

Helmsman serves as an MCP server or CLI tool.

# Install
cargo install helmsman

# Get instructions for current model
helmsman -i claude-opus-4-5-20251101  # Resolves to AGI tier

# Or use tier aliases
helmsman -i a    # AGI
helmsman -i e    # Engineer
helmsman -i m    # Monkey

# See the difference between tiers
helmsman -i a --diff m

Template System¶

Instructions live in Jinja2 templates.

“Why Jinja2? Why not Tera or ERB?”

Strategy. Most data scientists have experience with Python, hence Jinja2. They don’t have to learn Rust templating, Go’s text/template, or Ruby’s ERB. Those are the people who are going to build the top-tier skills and AGENTS.md files. Meet them where they are.

# Project Guidelines

{% if model.tier == "monkey" %}
## Step-by-Step Instructions

1. Always read files before editing
2. Never guess at file contents
3. Run tests after every change
4. Ask if requirements are unclear

## What NOT To Do

- Don't create new files without asking
- Don't refactor unrelated code
- Don't skip error handling
{% else %}
Read before edit. Test after change. Ask when unclear.
{% endif %}

{% if env.has_mise %}
Use `mise` for runtime management.
{% elif env.has_nvm %}
Use `nvm` for Node versions.
{% endif %}

{% if env.in_docker %}
You're in a container. No system package installation.
{% endif %}

Environment Detection¶

Helmsman detects:

OS: macOS, Debian, Arch, Alpine, FreeBSD
Shell: zsh, bash, fish
Container: Docker, Podman, LXC (via cgroup parsing)
SSH: Detected via SSH_CLIENT / SSH_TTY
Tools: git, gh, curl, mise, brew, apt, pkg, nvm, rbenv, pyenv

All available in templates as {{ env.os }}, {{ env.has_mise }}, {{ env.in_docker }}, etc.

Model Resolution¶

Model IDs resolve to tiers via glob patterns:

# Embedded in binary, overridable in helmsman.toml
[models]
"claude-opus-*" = "agi"
"claude-*-sonnet-*" = "engineer"
"claude-*-haiku-*" = "monkey"
"gpt-5.2-xhigh*" = "agi"
"gpt-5.2-high*" = "agi"
"gpt-5.2-medium*" = "engineer"
"gpt-5.2-mini*" = "monkey"
"deepseek-r3*" = "agi"
"gemini-*-pro*" = "engineer"
"gemini-*-flash*" = "monkey"

Unknown models default to engineer tier (safe middle ground).

MCP Integration¶

Helmsman runs as an MCP server:

{
  "mcpServers": {
    "helmsman": {
      "type": "stdio",
      "command": "helmsman"
    }
  }
}

Provides:

Prompts: instructions (get tailored instructions), skill (render a specific skill)
Resources: skill:/// (list skills), skill:///{name} (render skill)

The model asks for instructions, Helmsman detects which model is asking, serves appropriate verbosity.

Skills: Reusable Instruction Modules¶

Skills are templates in .skills/ directories:

project/
├── .skills/
│   ├── commit.tpl      # Git commit helper
│   ├── review.tpl      # Code review skill
│   ├── _header.tpl     # Partial (included in others)
│   └── _footer.tpl     # Partial
└── helmsman.toml

Install skills from GitHub:

# Install from repository
helmsman add seuros/helmsman-skills

# List available skills
helmsman add seuros/helmsman-skills --list

# Install specific skill
helmsman add seuros/helmsman-skills -s commit

# Install globally
helmsman add seuros/helmsman-skills --global

Skills have frontmatter for metadata:

---
name: commit
description: Git commit with conventional commits
topics: [git, workflow]
tiers: [monkey]  # Not shown to AGI and Engineers (they know this)
authors: [Abdelkader Boudih]
---

The TOON Connection¶

In my TOON post, I documented what happens when you assume all models understand the same format:

DeepSeek started speaking Mandarin mid-discussion Gemini replied in Russian Claude refactored my Ruby code to Java

The problem wasn’t bad models. It was instruction mismatch.

TOON tried to save tokens by stripping context. But smaller models NEED that context. They don’t have the capability to infer what you meant.

Helmsman flips this:

AGI models: Strip verbosity. They understand constraints.
Monkey models: Add verbosity. They need explicit guidance.
Engineer models: Balance both.

You’re not saving tokens by being brief. You’re saving tokens by being appropriate.

An AGI model getting verbose instructions wastes expensive tokens on obvious guidance. A Monkey model getting sparse instructions wastes even more tokens on retries when it hallucinates.

Token Counting¶

Helmsman includes token counting (tiktoken cl100k_base):

# Show token count
helmsman -i a -t
# Output: 847 tokens

helmsman -i m -t
# Output: 2341 tokens

# Show diff with token savings
helmsman -i a --diff m
# Shows: AGI saves 1494 tokens vs Monkey tier

This isn’t about minimizing tokens. It’s about appropriate tokens.

Opus at $15/M input tokens getting 2341 tokens of hand-holding instructions? That’s $0.035 wasted per request.

Haiku at $0.25/M input tokens getting 847 sparse tokens and hallucinating? That retry costs more than the verbose instructions would have.

Why “Helmsman”?¶

A helmsman steers the ship. They don’t run the engines. They don’t plot the course. They translate the captain’s intent into appropriate control inputs for current conditions.

Clear space? Light touch on the thrusters. Asteroid field? Full attention, precise corrections.

Same destination. Different approach based on conditions.

Helmsman translates your intent into appropriate instructions for current model capabilities.

What Helmsman Is NOT¶

Not a prompt engineering framework: No chains, no agents, no memory
Not model learning: Stateless, deterministic adaptation
Not configuration management: Just instruction serving
Not teaching you to prompt: Assumes you know what you want

It’s infrastructure. Plumbing. The thing that makes your instructions work across model tiers without you manually maintaining three versions.

Quick Start¶

# Install
cargo install helmsman

# Create instruction template
cat > AGENTS.md.tpl << 'EOF'
# Project: {{ project.name | default("Unknown") }}

{% if model.tier == "monkey" %}
## Detailed Guidelines

1. Always read files before editing them
2. Never create files without explicit permission
3. Test all changes before committing
4. Ask clarifying questions when requirements are ambiguous

## Prohibited Actions

- Creating new dependencies without discussion
- Refactoring unrelated code
- Skipping error handling
- Using deprecated APIs
{% elif model.tier == "engineer" %}
## Guidelines

- Read before edit
- Test before commit
- Ask when unclear
- No unnecessary refactoring
{% else %}
Constraints: read-before-edit, test-before-commit, ask-when-unclear.
{% endif %}

{% if env.os == "freebsd" %}
Use `pkg` for packages. No apt/brew.
{% endif %}
EOF

# Test rendering
helmsman -i a    # AGI version
helmsman -i m    # Monkey version

# Run as MCP server
helmsman

The Real Problem with Static Instructions¶

Your AGENTS.md is a lie. Not intentionally. But practically.

It says “follow these guidelines” but:

Opus ignores half of them (already knows)
Haiku misinterprets the other half (needs more context)
Sonnet does okay but could be more efficient

You wrote instructions for an imaginary average model that doesn’t exist.

Helmsman serves instructions for the actual model asking.

Thanks¶

To my friends who beta tested this so I don’t end up in happy-path land and release crap like TOON:

They break my concepts before I ship them. That’s the difference between “works on my machine” and “works.”

Also thanks to the Crush team and Andrey Nering for not dismissing my proposal to make Helmsman compatible with the Crush harness.

GitHub | BSD-3-Clause

Different models. Different capabilities. Different instructions.

That’s not complexity. That’s honesty.

🔗 Interstellar Communications

No transmissions detected yet. Be the first to establish contact!

• Link to this post from your site• Share your thoughts via webmention• Join the IndieWeb conversation

tools/listChanged Is a Bug, Not a Feature: What Claude Code Gets Wrong

I just watched Claude Code ignore the MCP spec in real-time. The server sent tools/listChanged. The client did nothing. I had to manually reconnect. This is not a feature -- it is a bug hiding behind silence.

AIMCPLLM

Dec 14, 2025

Skills Are Not Skills: The MCP Misunderstanding Nobody Wants to Correct

Skills are tutorials. MCP servers are executables. One tells Claude what to do. The other does it. The difference matters, and the ecosystem is lying to you about it.