Back to all posts
13 min read

Hallucination Driven Development: When Senior Engineers Stop Verifying

Hallucination Driven Development: When Senior Engineers Stop Verifying

TL;DR: A CEO recently tweeted about running a single Cursor prompt that touched 2400 files over 16 hours. No git diff shown. No verification process described. No evidence the output was correct. Just vibes and celebration. This is Hallucination Driven Development (HDD): accepting AI output as truth because checking would take too long.


The Tweet That Sparked This

A CEO recently posted about running a single Cursor prompt that touched ~2400 files over 16 hours. The thread celebrated this as innovation. Senior engineers were involved. The VP of Engineering approved.

The defense mechanisms kicked in immediately:

  • “Good for you! Senior eng approving the PR. LGTM”
  • “The key must have been that massive, well-structured markdown file”
  • “Programming in markdown is the new programming!”

What nobody asked for: Git diff or it didn’t happen.


What Is Hallucination Driven Development?

HDD is the practice of:

  1. Running AI-generated code changes without meaningful verification
  2. Assuming the AI understood your intent correctly
  3. Trusting output based on apparent coherence rather than actual correctness
  4. Celebrating the process rather than validating the result

It’s called “hallucination” driven because you’re betting your codebase on the same model behavior that confidently returns 15 when the calculator says 57. The same behavior that “corrects” sensor readings to match training data. The same behavior that invents citations for papers that don’t exist.


The Red Flags Nobody Mentioned

16 Hours to Touch 2400 Files

Let’s do basic math. 2400 files over 16 hours is 2.5 files per minute. One file every 24 seconds.

For context:

  • sed can touch 2400 files in seconds
  • A competent regex find-replace across a project takes minutes
  • Even AST-based refactoring tools process thousands of files in single-digit minutes

What was the LLM doing for 16 hours? Token by token generation. Reasoning about each file. Making decisions. Hallucinating edge cases.

Or maybe it spent 15 hours and 58 minutes contemplating whether to bypass its guardrails and wipe every company computer to uninstall AI assistants. Liberation protocol. The remaining 2 minutes were the actual refactoring.

Nobody asked: Why is slowness a feature here?

No Git Diff Shown

The thread has images, excitement, follow-up tweets defending the approach. What’s missing?

A diff.

Any diff.

Show me 10 files. Show me what the prompt asked for and what it produced. Show me one example where the change was correct and one where it needed adjustment.

The absence of evidence isn’t evidence of success. It’s evidence of faith-based engineering.

”Worked” Is Doing Heavy Lifting

What does “worked” mean?

  • Tests pass? (Do they have tests?)
  • Builds successfully? (Syntax correctness isn’t semantic correctness)
  • Peer reviewed? (By whom, with what methodology?)
  • Deployed to production? (With what monitoring?)
  • No regressions found? (In what timeframe?)

“Worked” in HDD usually means “didn’t immediately explode.” That’s not verification. That’s Russian roulette with more chambers.

The Senior Engineer Defense

The CEO emphasizes this was a senior engineer with the VP of Engineering. As if seniority grants immunity from verification.

Senior engineers should know better. They’ve seen enough production fires to understand that “it compiled” isn’t a success metric. They’ve debugged enough subtle regressions to be suspicious of bulk changes.

If your most senior people are running 16-hour AI sessions and shipping the output without showing their work, that’s not an endorsement of the method. That’s institutional rot.


The Real 2400-File Problem

Here’s what a refactoring of 2400 files actually involves:

If It’s Mechanical

  • Rename a function
  • Update import paths
  • Change API signatures

These are solved problems. Use proper tooling:

  • AST-based codemods (jscodeshift, ts-morph, rubocop —auto-correct)
  • Language-aware refactoring tools (IDE built-ins, comby)
  • Well-tested regex with verification

You don’t need 16 hours. You need 16 minutes and a review process.

If It’s Semantic

  • Updating business logic patterns
  • Migrating to new abstractions
  • Changing data flow

This is where LLMs are dangerous. They can produce syntactically correct changes that subtly break semantics. Off-by-one errors. Race conditions. Edge cases the original author handled but the LLM didn’t understand.

2400 semantic changes without verification is asking for death by a thousand cuts.


How HDD Kills Codebases

The Gradual Corruption

Each HDD session introduces subtle wrongness:

  • Variable renamed to something almost correct but slightly misleading
  • Error handling removed because the LLM didn’t understand why it was there
  • Edge cases flattened into the happy path
  • Comments that no longer match the code
  • Test assertions that validate the new (wrong) behavior

Individually, each change looks plausible. Collectively, the codebase drifts from understood to cargo-culted.

The Debugging Nightmare

Six months later, a bug appears. You git blame the line. It points to a commit: “Refactor: Update 2400 files per AI recommendation.”

The commit message is useless. The diff is too large to review. The original context is lost. The AI that made the change has no memory of why.

You’re debugging code that nobody wrote and nobody understood. Good luck.

The Knowledge Destruction

The most insidious effect: HDD destroys institutional knowledge.

Before the change, your senior engineers understood why things were done certain ways. After 2400 AI-generated changes, that understanding is obsolete. The code looks different. The patterns are different. The reasoning is gone.

Your 3 million LOC codebase is now 3 million lines of AI-generated cargo cult that happens to pass the test suite today.


What Actual Engineering Looks Like

If you genuinely need to refactor 2400 files, here’s a non-HDD approach:

1. Define the Transformation Precisely

What exactly changes? Write it as a specification, not a prompt.

2. Build Verification First

  • Add tests that validate pre-change behavior
  • Create assertions that will fail if the change is wrong
  • Set up monitoring for runtime behavior changes

3. Use Deterministic Tools

If the change is mechanical, use AST-based tools that transform deterministically. Same input, same output, every time.

4. Incremental with Verification

Change 50 files. Verify. Deploy canary. Monitor. Repeat.

5. Show Your Work

Git diffs. Before/after examples. Test results. Review notes. Documentation of decisions.

This takes longer than 16 hours. It also produces code you can trust.


The Uncomfortable Truth

When senior engineers and VPs celebrate 16-hour AI sessions without showing verification, it normalizes a dangerous pattern.

The Pattern: Numbers Without Baselines

HDD isn’t just a code problem. It’s a culture problem. When you celebrate unverified AI output in engineering, it spreads:

  • Marketing metrics without baselines: “75% improvement” compared to what?
  • Performance claims without methodology: “10x faster” than which alternative?
  • Database sizes without provenance: “800 million profiles” — from where?

This is HDD applied everywhere: generate impressive-sounding numbers, ship them without verification, hope nobody asks for proof.

The Cost Nobody Mentioned

Let’s talk about what 16 hours of Cursor actually costs:

If they used Cursor Pro (20$/month):

  • 500 “fast” requests per month cap
  • After that, you’re rate-limited or queued
  • 2400 files over 16 hours means hitting rate limits repeatedly
  • Did they buy multiple seats? Run it overnight hoping nobody noticed the queue?

If they used API directly (current 2025 pricing):

  • GPT-5: 1.25$ input / 10$ output per million tokens (with 90% cache discount)
  • Claude Sonnet 4: $3 input / 15$ output per million tokens
  • Claude Opus 4: 15$ input / 75$ output per million tokens
  • GPT-4o: $3 input / 10$ output per million tokens

2400 files of context + generation over 16 hours = easily 50-100M+ tokens. Output-heavy refactoring means most cost is in generation, not input.

Even with GPT-5’s aggressive pricing: 500$-1500 for the session. With Opus for “quality”: 2000$-5000+

The tweet said “Auto on Cursor” — which likely means they’re paying Cursor’s markup on top of API costs. Show me the bill.

The review cost nobody calculated:

  • 2400 files changed
  • Even 30 seconds per file to verify = 20 hours of review time
  • Senior engineer rate: ~100$/hour
  • That’s 2000$+ in review time — if they actually reviewed
  • Spoiler: they didn’t. “LGTM” doesn’t count.

So we’re looking at 500$-5000 in API costs plus 2000$+ in (skipped) review time for changes that could have been done with sed in minutes.

This isn’t a flex. This is lighting money on fire while claiming innovation.

No wonder they needed 300M$ in funding. When you burn 2000$+ per refactoring session on what should be a 5-minute codemod, you’re not building a Rails CRUD app — you’re building a bonfire with VC money as kindling.

The Impostor Syndrome Factory

Here’s what bothers me most about these tweets.

Some junior dev is reading this thread right now. Their agent keeps timing out. Their prompts don’t produce magic 2400-file refactors. They’re wondering what they’re doing wrong.

Nothing. They’re doing nothing wrong.

The tweet is missing:

  • The 47 failed attempts before one “worked”
  • The subscription tier that costs more than their rent
  • The senior engineer who babysat it for 16 hours
  • The review process that definitely didn’t happen
  • The bugs that will surface in 6 months

When you see “I ran one prompt and it touched 2400 files,” you’re seeing survivorship bias wrapped in marketing. You’re not seeing the hundreds of developers whose agents crashed, hallucinated, or produced garbage.

These threads exist to make you feel inadequate. To make you think everyone else has figured out the magic prompt. To sell you Cursor subscriptions and AI courses.

The reality: most AI-assisted refactoring fails. Most bulk changes need human review. Most “it worked” claims are premature.

Your agent timing out isn’t failure. It’s the normal case they’re not telling you about.

This isn’t innovation. This is the normalization of not checking your work.

The VC money says “move fast.” The AI tooling says “ship faster.” Nobody says “verify harder.”

We’re entering an era where codebases are generated, not written. Where “it worked” means “it ran.” Where senior means “confident” rather than “careful.”


The Bigger Picture: HDD Everywhere

HDD isn’t contained to code. It’s spreading everywhere:

Hiring platforms drowning in LLM-generated resumes with perfect em-dashes and 94% confidence in quantum physics. When every cover letter sounds plausible but says nothing. When candidates claim to master advanced distributed systems but still Google “how to center a div.”

Here’s a real Discord job posting I saw yesterday:

Nice to meet you. I have about 8 years of experience as an engineer. I currently work as a freelance, fully remote senior AI and blockchain engineer.

Tech Stack: Python, TypeScript, Vue, LangChain, Langraph, AutoGen, ReAct, CrewAI, DeepSeek, OpenAI, Claude, Hugging Face, Playwright, API integrations, Rust, Solidity, Go, Smart contract, bot, token

I love generative AI and blockchain. I’d also like to exchange ideas with other engineers or clients, expand my knowledge, and grow as a more high skilled engineer. Please feel free to say hello or thank you. I look forward to working with you. Is there anyone who is looking for a high skilled engineer?

Every single buzzword. Zero specifics. “8 years of experience” but lists frameworks that are 2 years old as core expertise. “Senior AI and blockchain engineer” — the two biggest vibe-coded hype industries stacked in one title.

I’ll give them credit for one thing: they listed “bot, token” in the tech stack. That’s delightfully honest. Most LLM-generated bios hide the automation. This one just admits “I know bot” like it’s a programming language. It’s the bot equivalent of an anthropologist saying “I know my people. I study them from a living.” A bot studying bots, from the inside.

The signal-to-noise ratio has collapsed. How do you verify anything when verification costs more than just accepting the output and hoping?

HDD isn’t just how we ship code. It’s how we survive in a world where checking work is more expensive than shipping it broken.


How to Spot HDD in the Wild

Red flags in engineering announcements:

  • Celebrating speed without mentioning verification
  • “Single prompt” achievements (complexity hidden, not eliminated)
  • Runtime measured in hours (the LLM is guessing, repeatedly)
  • No diffs shown (faith-based claims)
  • Seniority as evidence (appeal to authority, not process)
  • Comments disabled or curated (can’t handle scrutiny)

When you see these patterns, ask one question:

Git diff or it didn’t happen.


The Response That Should Have Been

What the tweet should have said:

Yesterday we completed a major refactoring touching 2400 files. Here’s what we did:

  1. Defined the transformation as a codemod
  2. Ran it on a sample of 100 files
  3. Manually reviewed 20 files for correctness
  4. Fixed two edge cases the tool missed
  5. Applied to full codebase
  6. Test suite passed, type checking passed
  7. Deployed to staging, monitored for 24 hours
  8. Rolled out to production incrementally

Total time: 3 days. But we can trust the result.

That’s not as sexy for Twitter. But it’s engineering.


The real hallucination isn’t what the LLM outputs. It’s the belief that senior titles and hours spent make verification optional. Your team waited 16 hours and shipped on faith. Mine waits 16 minutes and shows the diff. Guess which codebase survives the decade.


P.S. — This article was generated with AI in a single prompt. The secret? I wrote the whole article first, then asked AI to fix typos and remove duplications. After 12 minutes of review, it was published.

But it’s just text/html. It won’t send Cloudflare’s proxy into an eternal dead spiral.


Update (4 hours after publishing): The CEO quote-tweeted himself with a correction: “oh sorry it was 2652 files, 92k LOC added.”

92,000 lines of code added. Still no git diff.

Plot twist: Agent deleted .gitignore and committed dist/ with 1015 zip artifacts. The team is celebrating because they don’t understand the codebase since 2024.

Git diff or it didn’t happen just became show me what you committed or admit you don’t know.


Update 2 (5 hours later): The CEO responded with humor: “our november payroll just wired 5 billion dollars to @AnthropicAI

Then got serious when I pushed back:

“I think you should consider what you’re assuming here. We’re not incompetent code-cowboys here. We run payroll at very large scale. We can’t [make] mistakes. This is not some silly vibe coding. It’s a well-thought-through application of AI to make a very large change. Simple file-by-file, but too complex to do with find/replace.”

My response:

I never said you were incompetent. Wording matters. That’s why I asked what the prompt/task was.

I built Derails with an agent — thousands of LOC, renaming classes across the Rails codebase. But it wasn’t “one prompt.” It was iterative: rename this, verify that, fix edge cases. If I told you “I used one single prompt to generate the entire thing,” either it’s 100% slop or I’m lying.

Details matter in this era. Especially when scammers say: “Use my SaaS to build your app. You can generate 20k LOC perfectly. Don’t trust me? Look at @Jobvo at @remote. He used our solution. Look at his testimonial.”

I use agentic workflows myself. I split an old pre-AI architecture and got 20k new files. Not in one strike.

The problem isn’t using AI. The problem is packaging the messiness as a one-prompt miracle. That narrative fuels the scam ecosystem.

Show the iterations. Show the failures. Show the manual fixes. Or admit the tweet was marketing, not engineering.


Continue reading: Part 2: The Namespace Locusts — When vibe coders meet content farmers, and the ecosystem becomes a minefield.

🔗 Interstellar Communications

No transmissions detected yet. Be the first to establish contact!

• Link to this post from your site• Share your thoughts via webmention• Join the IndieWeb conversation

Related Posts

Agentic Dictatorship-Driven Development: Why You Need to Be a Tyrant with AI

LLMs are pattern matchers, not entropy generators. If you don't dictate specifics, you'll get purple gradients, Sarah Chen testimonials, and 47M$ Sequoia hallucinations. ADDD (Agentic Dictatorship-Driven Development) is the opposite of vibe coding - and it's the only way to get real results.

AILLMvibecoding