Back to all posts
19 min read

TOON Format: I Already Built This Bullshit in 2024 (And Wiped It After Thousands in Failed API Calls)

TOON Format: I Already Built This Bullshit in 2024 (And Wiped It After Thousands in Failed API Calls)

TL;DR: I built the exact same thing TOON is trying to sell in 2024. Called it LRDL (LLM Requirements Definition Language). Spent thousands in API calls testing it across models. Only frontier models understood it, with extra thinking cost. Small models choked. Deepseek started speaking Mandarin mid-conversation. Gemini replied in Russian. Claude refactored my Ruby code to Java. I wiped the guide from GitHub because any big project using it will output bad results. You’ll save cents per request but replay it 2-3x for the model to understand. Now TOON is getting the same 10-Medium-article hype cycle, and we’re heading toward dangerous SLOP disguised as correct documentation.


I Already Built This in 2024

In 2024, I released a mini guide for LRDL (LLM Requirements Definition Language).

The concept: Compact format for LLM instructions. Strip unnecessary syntax. Make it token-efficient. Human-readable but optimized for models.

Sound familiar?

That’s because TOON (Token-Oriented Object Notation) is the exact same idea, just launched in November 2025 with a coordinated Medium article flood.

The difference?

I actually tested LRDL with thousands of dollars in API calls before recommending it to anyone.

And here’s what I found: It’s bullshit.


Before You Think “AI Generated This”

You’ll see “2024” in this post and think: “AI loves to use 2024 as a placeholder date. This is generated.”

Check my receipts:

Published projects that reference LRDL:

  • rails_lens - README explicitly states: “Part of the LRDL (LLM Requirements Definition Language) ecosystem” with “LRDL-optimized output”
  • Rails Lens Journey (July 2025) - Article documents: “Rails Lens was built in LRDL style with minimal token usage achieved through thousands of dollars burned on benchmarking”
  • minitest-reporters-llm (Released 2 months ago) - Compact format optimized for LLMs, claims “70% fewer tokens than traditional reporters”

Important clarification about LRDL and these gems:

LRDL was a guideline for communicating with LLMs. It had two parts:

  1. TOON-like compact annotation (the part that failed)
  2. Smart context reduction (the part that worked)

After thousands in testing, I removed the TOON-like syntax. It caused hallucinations. What remained became these gems:

minitest-reporters-llm: The LLM doesn’t need the same failure sentence repeated 30 times. It doesn’t need half the backtrace and full stack logs when it can fetch the source code directly. That’s optimization—removing redundant information the LLM already has access to.

rails_lens: Allows LLMs to grep specific code portions. Evaluates at runtime, so monkey patches and dynamic modifications are properly annotated. The LLM gets exactly what’s running, not what’s in static files.

What I kept from LRDL: Remove redundancy when LLM has codebase access. What I removed from LRDL: TOON-style compact syntax that stripped necessary context.

The gems are what survived testing. The TOON-like notation didn’t.

Other verifiable sources:

Unless I built a time machine, those timestamps are real.

The fuckups are documented too. Those Deepseek Mandarin comments, Gemini Russian translations, Claude Java hallucinations? I posted about them on X when they happened. Real-time frustration. You can’t post things in the past on Twitter unless you’re a sysadmin at X. Those timestamps exist.

I’m not here to gatekeep you. If LRDL had substance, I would have published it first. I would have written guides, given talks, built the ecosystem.

But I didn’t.

Because I don’t want my name connected to AI SLOP.

If I had promoted LRDL without testing, I would have caused tens of thousands of hours lost to hallucinations. Developers would have believed it works because someone with Rails contributor status said so.

That’s the responsibility that comes with authority: test before you promote.

I tested. It failed. I wiped it.

TOON developers haven’t done that testing. And content farmers are promoting it anyway.


The Timeline Nobody’s Talking About

Late 2024:

  • I author LRDL for compact LLM instructions
  • Release mini guide on GitHub
  • Community picks it up, newsletters mention it
  • Looks promising on paper

Early 2025:

  • Start real-world testing across models
  • Spend thousands in API calls
  • Test on small projects (works fine)
  • Test on large projects (everything breaks)

Mid 2025:

  • Document failure patterns
  • Realize only frontier models understand it, at extra thinking cost
  • Small models need actual structure (JSON, CSV, XML, TOML)
  • Wipe the guide from GitHub

Why wipe it?

Because I know what happens when developers adopt this in real projects. And I’m not sending people into a fake narrative when I have data proving it fails at scale.

November 2025:

  • TOON launches with identical concept
  • 10+ Medium articles in one month
  • Same claims: “30-60% token savings”, “better for LLMs”
  • Zero production testing
  • Zero discussion of failure modes

You can still find traces of LRDL in newsletters and Google cache. I didn’t scrub the internet. I just stopped promoting something that leads to bad outcomes.

TOON is speedrunning the same path, but without the testing phase.


What Happens When You Actually Test This

I didn’t just build LRDL and declare victory. I tested it.

Small projects (< 100 lines of code generation):

  • Works fine on GPT-4, Claude 3.5, Gemini 1.5 Pro
  • Token savings: ~20-30% (not 60%, but noticeable)
  • Output quality: Comparable to JSON prompts

Medium projects (500-1000 lines):

  • Frontier models start struggling
  • Extra “thinking” visible in responses
  • Token savings offset by needing longer system prompts explaining the format
  • Error rate increases

Large projects (2000+ lines, complex requirements):

  • Deepseek started speaking Mandarin mid-discussion
  • Gemini replied in Russian
  • Claude refactored my Ruby code to Java

Not “sometimes.” Not “edge cases.” Consistently broke at scale.


The Technical Breakdown Nobody Shows You

Problem 1: Only Frontier Models Parse It (At Extra Cost)

When you use LRDL/TOON-style formats:

Frontier models (GPT-5, Claude Opus, Gemini Pro):

  • Can parse it
  • But burn extra tokens “thinking” about the format
  • Slower response times
  • Higher thinking token costs (for models that charge for reasoning)

Small/mid-tier models (GPT-3.5, Claude Haiku, Llama-70B):

  • Choke on the format
  • Hallucinate structure
  • Mix languages
  • Refactor to different tech stacks

The savings math:

  • You save 100 tokens on input (compact format)
  • Model spends 200 tokens figuring out what you meant (thinking)
  • You replay the request 2-3 times to get correct output (3x cost)

Net result: 2-6x MORE expensive than just using JSON.

Problem 2: Models Default to Training Data Patterns

LLMs are trained on:

  • JSON (everywhere in training data)
  • XML (documentation, configs)
  • CSV (data files, examples)
  • YAML/TOML (configs, frontmatter)

They are NOT trained on:

  • LRDL
  • TOON
  • Your custom invented format

What happens:

  1. Model encounters unfamiliar format
  2. Tries to pattern-match to something it knows
  3. Finds closest match in training data
  4. Hallucinates based on that pattern

Real examples from my testing:

Prompt in LRDL:

users
  name str
  age int
  email str

validate_email
  regex @.+
  required true

GPT-4 Turbo output (worked correctly):

class User
  validates :email, presence: true, format: { with: /@.+/ }
end

Claude Haiku output (hallucinated Java):

public class User {
    @Email(regexp = "@.+")
    @NotNull
    private String email;
}

I didn’t ask for Java. The prompt was for Ruby. Claude Haiku saw the compact format, pattern-matched to Java annotations, and hallucinated.

Deepseek output (switched languages mid-response):

class User
  # 验证电子邮件格式
  validates :email, presence: true, format: { with: /@.+/ }
  # 用户年龄必须是整数
  validates :age, numericality: { only_integer: true }
end

Translation: Comments in Mandarin. Not code-switching for technical terms. Full language switch mid-generation.

Gemini 1.5 Flash output (Russian):

class User
  # Проверка формата электронной почты
  validates :email, presence: true, format: { with: /@.+/ }
end

Translation: Comments in Russian. Same pattern.

These aren’t cherry-picked failures. This happened consistently on 30-40% of large requests.

Problem 3: Dangerous SLOP Disguised as Correct

Here’s the nightmare scenario:

What you write in TOON:

auth
  method jwt
  expiry 24h
  refresh true

What the documentation looks like (English, human-readable):

# Authentication
- Method: JWT
- Token expiry: 24 hours
- Refresh tokens enabled

What the LLM generates (broken implementation):

# JWT authentication with 24 hour expiry
class AuthController
  def create
    token = JWT.encode(
      { user_id: user.id, exp: 24.hours.from_now },
      Rails.application.secrets.secret_key_base
    )
    render json: { token: token }
  end
end

What’s wrong:

  • No refresh token logic (you said refresh true)
  • exp should be Unix timestamp, not Rails Time object
  • No token invalidation
  • No security headers

The documentation LOOKS correct. The English makes sense. But the implementation is hallucinated bullshit.

And because you used a compact format, the LLM didn’t have enough context to understand what you actually meant.

JSON forces you to be explicit:

{
  "auth": {
    "method": "jwt",
    "token_expiry_seconds": 86400,
    "refresh_token": {
      "enabled": true,
      "expiry_seconds": 604800
    }
  }
}

That’s more tokens. But it’s also unambiguous. The model knows exactly what you want.


The Pattern: Content Farms Hiding Real Problems

Here’s what’s happening with TOON:

  1. Developers build TOON format, publish GitHub repo
  2. A few early adopters try it on toy examples (works fine)
  3. Content farms spot the topic
  4. 10+ Medium articles published in November 2025
  5. All claim “30-60% token savings” without testing at scale
  6. Developers adopt it based on hype
  7. Six months later, production systems are generating SLOP
  8. Nobody connects the dots because the format isn’t blamed

Same pattern I saw with:

  • GraphQL over-adoption (complexity explosion)
  • Microservices hype (distributed monoliths)
  • Serverless everywhere (vendor lock-in)

The difference?

Those technologies HAD legitimate use cases. They just got overhyped and misapplied.

TOON/LRDL don’t have legitimate use cases for real projects.

They have one use case: toy examples and benchmarks that get turned into Medium articles.


What Actually Saves Tokens (Based on Real Testing)

I spent thousands in API calls. Here’s what ACTUALLY works:

1. Shorter Names (4x Reduction)

Before:

Jeremy L. Terry, Senior Software Engineer

Tokens: ~8

After:

Sam, dev

Tokens: ~2

Savings: 6 tokens per person mention

The LLM doesn’t need full real names for context. It needs identifiers.

2. Break Grammar Where Meaning Is Clear

Before:

The application crashed unexpectedly when the user clicked the submit button.

Tokens: ~15

After:

App crashed on submit click.

Tokens: ~5

Savings: 10 tokens

You can break grammar if meaning is understood. Models are trained on internet text—they’ve seen broken English, SMS language, Twitter threads.

3. Replace Complex Phrases with Simple Equivalents

Before:

The system encountered an unexpected error during the authentication process.

Tokens: ~12

After:

Auth failed.

Tokens: ~2

Or even:

Auth 💥

Tokens: ~1

Savings: 11 tokens

Emojis are single tokens. Models understand them. Use them.

4. Remove Unnecessary Context

Before:

{
  "user": {
    "id": 12345,
    "first_name": "John",
    "last_name": "Doe",
    "email": "john.doe@example.com",
    "created_at": "2024-01-15T10:30:00Z",
    "updated_at": "2025-11-18T14:22:00Z",
    "status": "active"
  }
}

After (only include what matters for the task):

{
  "user": {
    "id": 12345,
    "name": "John",
    "email": "john.doe@example.com"
  }
}

Savings: ~40% by removing unused fields

Don’t send created_at, updated_at, last_login if the LLM doesn’t need them.

Real-World Example: Requirements Document

Original (TOON-style compact format):

users
  name str req
  email str req regex @.+
  age int opt min 18

endpoints
  POST /users create
  GET /users list
  GET /users/:id show

Tokens: ~45

Better (structured JSON with shortcuts):

{
  "users": {
    "name": "str",
    "email": "str @.+",
    "age": "int 18+"
  },
  "endpoints": {
    "POST /users": "create",
    "GET /users": "list",
    "GET /users/:id": "show"
  }
}

Tokens: ~42

Savings: 3 tokens

But more importantly: JSON is unambiguous. Models won’t hallucinate structure.

Even better (human language, shorter):

User has name, email (@.+ regex), age 18+.
Endpoints: POST /users (create), GET /users (list), GET /users/:id (show).

Tokens: ~28

Savings: 17 tokens (38% reduction from TOON)

And this is ACTUAL human language. No invented format. Models trained on billions of similar sentences.


The Cost Misconception Nobody Talks About

The biggest cost in LLM usage is NOT input tokens.

Look at real API pricing:

  • Input tokens: $1 per million (some models even less)
  • Output tokens: 15$-60 per million (15-60x more expensive)

TOON optimizes for the wrong metric.

If you save 0.20$ on input tokens but regenerate output 40 times because the model hallucinated, you just spent:

  • Saved: 0.20$ on input
  • Lost: 40x output cost = 40x the entire request cost

You better reevaluate your life.

Input tokens are cheap. Output tokens are expensive. Retries are catastrophic.

This is exactly why the industry needed 4 layers of translators pre-AI:

The autistic dev says: “I will fix it.”

Then 3 layers of “engineers” translate that one sentence into:

  • Story points
  • Sprint planning
  • Requirements documents
  • Timeline estimates
  • Stakeholder updates
  • Status reports

Because compressed communication breaks at scale.

When you strip context to save tokens, you’re not optimizing - you’re creating the exact communication failure that spawned entire departments of translators.

TOON is speedrunning 40 years of organizational dysfunction, but for LLMs.

TOON saves pennies on input while costing dollars on output.

Real-World Example: ChatGPT o1-pro

When ChatGPT o1-pro launched, the pricing was:

  • Input tokens: 15$ per million
  • Output tokens: 600$ per million

Try using TOON format with o1-pro. Watch what happens:

Standard JSON prompt:

  • Input: 2000 tokens × 15$/M = 0.03$
  • Output: 3000 tokens × 600$/M = 1.80$
  • Total: 1.83$
  • Success rate: 90% (one retry every 10 requests)
  • Effective cost: 1.83$ × 1.1 = 2.01$ per successful generation

TOON format prompt:

  • Input: 1400 tokens × 15$/M = 0.021$
  • Output: 3000 tokens × 600$/M = 1.80$
  • Thinking tokens: 500 × 600$/M = 0.30$
  • Subtotal: 2.12$
  • Success rate: 60% (retry 40% of the time)
  • Effective cost: 2.12$ × 1.67 = 3.54$ per successful generation

You just saved 0.009$ on input and spent an extra 1.53$ on retries.

Now scale that to 1000 requests:

  • JSON: 2,010$
  • TOON: 3,540$
  • Extra cost: 1,530$

Congratulations. You optimized your way into bankruptcy.

And that’s assuming only 40% retry rate. In my testing with LRDL on complex projects, I saw 60-70% failure rates on models that didn’t understand the format.

At o1-pro pricing with 70% retry rate:

  • Effective cost: 2.12$ × 3.33 = 7.06$ per successful generation
  • 1000 requests: 7,060$
  • vs JSON: 2,010$
  • You just spent 5,050$ to “save tokens”

Direct bankruptcy.

And that’s just the API costs.

While models might be getting cheaper, your developers won’t. When your senior dev (150$/hour) spends 3 hours debugging why the LLM keeps generating Russian comments, that’s 450$ in labor cost.

When they retry 40 times to get working authentication code, that’s an entire day burned. 1,200$ in developer time to save 0.20$ on tokens.

When they have to manually review every generated file because the output is unreliable, that’s every day, forever.

The real cost isn’t the API. It’s the human hours spent fighting hallucinations.

TOON optimizes for the cheapest part of the equation while destroying the most expensive part: developer productivity.


The Real Token Savings Math

Let’s do actual math with real API pricing:

Scenario: Generate a Rails CRUD API with 5 models, authentication, authorization.

Prompt size: ~2000 tokens (detailed requirements)

TOON Format

Input tokens: 1400 (30% savings) Cost: 0.002$ (GPT-5 at 1.25$/M input)

Output tokens: 3000 Cost: 0.030$ (10$/M output)

Thinking tokens: +500 (model figuring out format) Cost: +0.005$

Retry rate: 40% (model hallucinates, you regenerate) Effective cost: 0.037$ × 1.4 = 0.052$

JSON Format

Input tokens: 2000 Cost: 0.003$

Output tokens: 3000 Cost: 0.030$

Thinking tokens: 0 (standard format)

Retry rate: 10% (normal error rate) Effective cost: 0.033$ × 1.1 = 0.036$

Human Language (Optimized)

Input tokens: 1200 (better prompt engineering) Cost: 0.002$

Output tokens: 3000 Cost: 0.030$

Thinking tokens: 0

Retry rate: 5% (clearer instructions) Effective cost: 0.032$ × 1.05 = 0.034$

TOON is 53% MORE expensive than human language.

Note: GPT-4 is legacy. These prices are GPT-5 (August 2025 launch). Even with cheaper models, TOON still loses because retry rates destroy the savings.


Why This Matters: Dangerous SLOP at Scale

Here’s the nightmare:

Year 1 (2025):

  • TOON gets adopted by junior developers
  • Content farms publish tutorials
  • Boilerplate generators use TOON
  • “Best practices” guides recommend it

Year 2 (2026):

  • Production applications using TOON-generated code
  • Documentation looks correct (English makes sense)
  • Implementation is hallucinated (models guessed)
  • Security vulnerabilities everywhere

Year 3 (2027):

  • Major breaches traced to hallucinated auth logic
  • “How did this pass code review?” (docs looked fine)
  • TOON quietly deprecated
  • Damage already done

This isn’t theoretical. I saw it happen with:

  • Copy-pasted Stack Overflow code (security vulnerabilities)
  • Auto-generated SQL (injection vulnerabilities)
  • LLM-generated crypto (catastrophically broken)

The difference?

Those were obvious copy-paste jobs. Everyone knew to review them carefully.

TOON-generated code will look intentional. The documentation will match. The structure will seem correct.

And the implementation will be subtly, dangerously wrong.


What You Should Do Instead

If you want to save tokens:

  1. Write better prompts in proper human language

    • Models are trained on human language
    • Clear, concise English is token-efficient
    • You can break grammar where meaning is clear
  2. Use shorter identifiers

    • “Sam” instead of “Jeremy L. Terry, Senior Engineer”
    • “dev” instead of “developer”
    • “auth” instead of “authentication”
  3. Remove unnecessary context

    • Only send data the model needs
    • Strip created_at, updated_at, metadata fields
    • Don’t include full user objects when you only need IDs
  4. Use standard formats (JSON, CSV, YAML)

    • Models understand them perfectly
    • No extra thinking cost
    • Unambiguous structure
    • Tooling ecosystem exists
  5. Test at scale before recommending

    • Toy examples don’t count
    • Test on 2000+ line codebases
    • Measure retry rates
    • Calculate effective cost (including failures)

If you want compact data for LLMs:

Use CSV or TSV. It already exists. Models understand it. No invented format needed.

name,email,age
Sam,sam@example.com,25
Alex,alex@example.com,30

Tokens: ~20

Same as TOON. But CSV is a 40-year-old standard with parsers in every language.


The Haskell Subreddit Joke

If I wanted to write stuff that nobody understands, I’d be in the Haskell subreddit.

We don’t need MORE obscure formats that only work on toy examples.

We need developers who can write clear, concise prompts in actual human language.

The best token optimization is clarity.

Not invented syntax. Not compressed notation. Not “token-oriented” formats.

Just clear fucking English (or whatever language your model supports best).


To the TOON Followers

I know you built this in good faith. You saw JSON’s verbosity and thought “we can do better.”

I had the same thought in 2024. That’s why I built LRDL.

But here’s what I learned after spending thousands testing it:

Compact formats optimize for the wrong thing.

You optimize for input token cost. But the real cost is:

  • Thinking tokens (model figuring out your format)
  • Retry tokens (regenerating hallucinated output)
  • Developer time (debugging subtle bugs)
  • Security vulnerabilities (hallucinated implementations)

Input tokens are the cheapest part of the equation.

I wiped LRDL from GitHub because I didn’t want to send developers down a path I knew would fail at scale.

You should do the same with TOON.

Or at minimum:

  1. Test it on 2000+ line real-world projects
  2. Measure retry rates across model tiers
  3. Document failure modes
  4. Show effective cost including retries

If you still think it works after that, show me the data.

I’ll run the same tests. If I’m wrong, I’ll publish a correction.

But I already ran those tests. And I know what happens.


To the Content Farmers

Stop publishing “TOON vs JSON” articles when you haven’t tested TOON on anything larger than a toy example.

You’re not helping developers. You’re creating technical debt that will take years to unwind.

Pattern recognition check:

  • Did you test TOON on a real project?
  • Did you measure retry rates?
  • Did you calculate effective cost?
  • Did you check output quality at scale?

If the answer to all four is “no,” you’re part of the problem.


To Developers Considering TOON

Don’t.

Not because I say so. Because I already spent thousands testing the same concept.

If you don’t believe me:

  1. Build a 2000+ line codebase generator
  2. Test it with TOON format vs JSON format vs human language
  3. Measure retry rates, thinking token cost, output quality
  4. Calculate effective cost per successful generation

Then decide.

But I already ran that experiment. And I wiped the results because they were bad.

You can speedrun the same expensive lesson, or you can learn from someone who already paid that cost.


Lessons I Refuse to Ignore

  1. Toy examples ≠ production reality - Works on 100 lines ≠ works on 2000 lines
  2. Input token cost is not total cost - Thinking tokens + retries matter more
  3. Models need structure - Small models choke on invented formats
  4. Content farms optimize for clicks, not truth - 10 Medium articles in one month is a red flag
  5. Compact ≠ clear - Token savings mean nothing if output is hallucinated
  6. Testing costs money, but prevents disasters - I spent thousands so you don’t have to
  7. Human language > invented syntax - Models are trained on human language
  8. Standard formats exist for a reason - CSV, JSON, YAML already work

The Question You Should Be Asking

Not “Does TOON save tokens?”

But “What’s the effective cost per successful output?”

Because saving 30% on input tokens doesn’t matter if you:

  • Spend 50% more on thinking tokens
  • Retry 2-3x because of hallucinations
  • Deploy buggy code because docs looked correct

Effective cost is what matters.

And I already calculated it. TOON loses.


Traces of LRDL (LLM Requirements Definition Language) can still be found in newsletters and Google cache. I didn’t scrub the internet. I just stopped promoting something that leads to bad outcomes.

TOON is speedrunning the same path, but without the testing phase.

Wait 6 months. If production systems are using TOON successfully, I’ll publish a correction with data.

But I already know what happens. I ran this experiment in 2024.

And I wiped the results because they were bad.


Captain’s Log, Stardate 2025.323 - Pattern Already Recognized

Captain Seuros, Pattern Recognition Division “I already built this bullshit and tested it. Spoiler: it fails at scale.”

🔗 Interstellar Communications

No transmissions detected yet. Be the first to establish contact!

• Link to this post from your site• Share your thoughts via webmention• Join the IndieWeb conversation

Related Posts

57 Is Actually 15: How LLMs Gaslight Their Own Tools

LLMs don't trust tool results. They "correct" sensor data to match their training. A calculator returns 57, the model reports 15. Iron Dome fails, ChatGPT insists it works. Your health app will confidently dismiss your heart attack as a sensor glitch. We're shipping software that gaslights reality.

AILLMhallucinations