The Cascade Anomaly: When Background Jobs Attack

Captain’s Log, Stardate 2153.173 - Mining Command Vessel “Recursive Dream”

They say in space, no one can hear you scream. But they can definitely hear your job queues exploding.

What started as a routine mining operation in the Helix Nebula has turned into a cascade of exponential job processing that threatens to consume every computational resource in the fleet. As I write this, the Sidekick is processing 47,000 jobs per second, the Good Job Runner’s PostgreSQL database is at 94% capacity, and the Solid Queue… well, it’s doing its best.

This is the story of how our attempt to unify disparate job processing systems nearly brought down the entire mining fleet—and what we learned about the true cost of abstraction.

The Mining Coalition

Mining Fleet Formation

The Helix Nebula contains the richest deposits of quantum crystals in the sector—essential for our FTL drives. To extract them efficiently, we assembled our specialized mining fleet.

ARIA, the Atlas Monkey’s conductor system, would orchestrate this massive operation. As our ship’s analytical AI, she coordinates between all systems and provides real-time fleet analysis.

Mining Vessel “Sidekick” - My old command, now captained by my protégé. Armed with Sidekiq and Redis, it’s the fastest processor in the fleet. When you need to blast through millions of asteroid scanning jobs, nothing beats the Sidekick.

Mining Vessel “Good Job Runner” - Captain Chen’s PostgreSQL-powered beauty. What it lacks in raw speed, it makes up for in reliability. Every job is a database transaction—ACID-compliant and auditable.

Mining Vessel “Solid Queue” - The newest addition, running Rails 8’s default job backend. Captain Patel swears by its simplicity: “Why maintain Redis when your database can do the work?”

Command Ship “Recursive Dream” - My current vessel, attempting to coordinate all operations through the ActiveJob Protocol.

ARIA> “Captain, all mining vessels report ready. The asteroid field contains approximately 14 million processable fragments.” The ship’s conductor AI projected the data across multiple holographic displays.

Seuros> “Excellent. Initialize the ActiveJob coordination protocol. Let’s put our unified system to the test.”

If only I’d known what we were about to unleash.

The Promise of Unity

ActiveJob Architecture

Years ago, when I was still contributing to the core Rails fleet protocols, we designed ActiveJob to solve a fundamental problem: every ship had its own job processing dialect.

# On the Sidekick (Sidekiq)
ScanAsteroidWorker.perform_async(asteroid_id)

# On the Good Job Runner (GoodJob)
ScanAsteroidJob.perform_later(asteroid_id)

# On early vessels (DelayedJob)
Asteroid.delay.scan(asteroid_id)

ActiveJob promised a universal interface:

class ScanAsteroidJob < ApplicationJob
  queue_as :mining_ops

  retry_on NetworkError, wait: :exponentially_longer, attempts: 5

  def perform(asteroid_id)
    asteroid = Asteroid.find(asteroid_id)
    asteroid.scan_for_minerals
    MineralExtractionJob.perform_later(asteroid) if asteroid.valuable?
  end
end

ARIA> “Captain, I’m detecting some concerning patterns in the job distribution algorithm.”

Seuros> “Define ‘concerning’, ARIA.”

ARIA> “Each adapter interprets the retry logic differently. Sidekiq uses exponential backoff with jitter, GoodJob uses fixed intervals with database locks, and SolidQueue… appears to be creating duplicate retry jobs.”

The Cascade Begins

Job Queue Explosion

At 1427 hours, we initiated the mining operation. Each vessel began processing its assigned asteroid sectors.

@Lieutenant Torres>> “Captain, the Sidekick is reporting unusual Redis memory consumption. They’re at 12GB and climbing.”

Seuros> “Show me the job metrics.”

What appeared on screen made my blood run cold:

# Job creation on mineral discovery
class MineralExtractionJob < ApplicationJob
  def perform(asteroid)
    minerals = asteroid.extract_minerals

    minerals.each do |mineral|
      # This creates N new jobs
      ProcessMineralJob.perform_later(mineral)

      # Which each create transport jobs
      TransportMineralJob.set(wait: 5.minutes).perform_later(mineral)
    end
  end
end

@Captain Chen>> “Recursive Dream, this is Good Job Runner. Our PostgreSQL instance is struggling. We’re seeing job records created faster than we can process them. Current count: 2.4 million and rising.”

@Captain Patel>> “Solid Queue reporting. We’re experiencing… unexpected behavior. Jobs marked as completed are being requeued. Database locks are failing.”

ARIA> “Captain, I’ve identified the root cause. The ActiveJob retry mechanism is interacting differently with each adapter’s native retry logic. We’re getting geometric job growth.”

The Redis Meltdown

Redis Memory Crisis

The Sidekick, our fastest processor, was the first to feel the strain.

@Sidekick Engineering>> “Captain Seuros, we’re at 58GB Redis memory usage. At this rate, we’ll hit our 64GB limit in twelve minutes.”

Seuros> “Can you flush completed jobs?”

@Sidekick Engineering>> “Negative. The retry sets are holding references. Every failed job is creating multiple retry entries due to a conflict between ActiveJob’s retry_on and Sidekiq’s native retry.”

The problem was clear in the logs:

# ActiveJob retry configuration
retry_on NetworkError, wait: :exponentially_longer, attempts: 5

# But Sidekiq was also retrying with its own logic
# sidekiq.yml
:max_retries: 25
:retry_in: [10, 30, 60, 120, 240]  # seconds

# Result: Exponential explosion of retry jobs

ARIA> “Captain, I’m implementing emergency memory management protocols.”

# Emergency fix deployed to Sidekick
class CompactRedisJob < ApplicationJob
  def perform
    Sidekiq.redis do |conn|
      # Clear jobs older than 1 hour
      cutoff = 1.hour.ago.to_i

      conn.zremrangebyscore('retry', '-inf', cutoff)
      conn.zremrangebyscore('dead', '-inf', cutoff)

      # Aggressive memory reclamation
      conn.memory('PURGE')
    end
  end
end

PostgreSQL Under Siege

Database Capacity Crisis

@Captain Chen>> “Recursive Dream, we’re implementing emergency measures. Our job table has 8.7 million rows. Query performance is degrading.”

GoodJob’s strength—storing everything in PostgreSQL—had become its weakness:

# GoodJob's job preservation was overwhelming the database
class GoodJob::Job < ActiveRecord::Base
  # Every job, retry, and execution is a row
  # No automatic cleanup by default

  scope :finished, -> { where.not(finished_at: nil) }
  scope :expired, -> { finished.where('finished_at < ?', 1.day.ago) }
end

# Emergency cleanup deployed
GoodJob::Job.expired.in_batches(of: 10_000).destroy_all

@Good Job Runner Engineer>> “Captain Chen, we’re hitting table bloat. VACUUM FULL required, but that’ll lock the table.”

@Captain Chen>> “Do it. Better a brief outage than total database failure.”

The SolidQueue Surprise

SolidQueue Behavior

@Captain Patel>> “Recursive Dream, we’re seeing impossible behavior. Jobs we’ve never created are appearing in our queues.”

ARIA> “Captain, I’m analyzing the Solid Queue logs. It appears their concurrency control differs significantly from the others.”

SolidQueue, being the newest system, had its own interpretation of job uniqueness:

# SolidQueue's different approach
class SolidQueue::Job < ActiveRecord::Base
  # Uses database-specific locking
  # But interprets job arguments differently

  def self.create_unique(attributes)
    # This was creating "unique" jobs that weren't actually unique
    # due to serialization differences
    transaction do
      lock.where(job_hash: hash_job(attributes)).first_or_create!(attributes)
    end
  end
end

@Captain Patel>> “We’re switching to manual mode. Disabling ActiveJob integration until we understand the behavior.”

Understanding the Differences

System Comparison

ARIA> “Captain, I’ve completed my analysis. Each system has fundamentally different philosophies:“

Sidekiq: Speed Above All

# Sidekiq's philosophy: Be fast, rely on Redis
class SidekiqAdapter
  # - Jobs are JSON in Redis
  # - Minimal overhead
  # - Fire-and-forget by default
  # - Retries handled by Sidekiq middleware

  def enqueue(job)
    Sidekiq::Client.push(
      'class' => JobWrapper,
      'wrapped' => job.class,
      'queue' => job.queue_name,
      'args' => [job.serialize]
    )
  end
end

GoodJob: Reliability Through PostgreSQL

# GoodJob's philosophy: Your database is already reliable
class GoodJobAdapter
  # - Every job is a database row
  # - Full ACID compliance
  # - Rich querying capabilities
  # - Built-in dashboard

  def enqueue(job)
    good_job = GoodJob::Job.create!(
      queue_name: job.queue_name,
      serialized_params: job.serialize,
      scheduled_at: job.scheduled_at
    )

    GoodJob::Notifier.notify(good_job) if async_mode?
  end
end

SolidQueue: Modern Simplicity

# SolidQueue's philosophy: Database-backed with modern design
class SolidQueueAdapter
  # - Works with any SQL database
  # - Polling-based by default
  # - Simpler than GoodJob
  # - Rails 8 default

  def enqueue(job)
    SolidQueue::Job.create_from_active_job(job)
  end
end

The Resolution

Fleet Coordination

Seuros> “All vessels, implement Protocol Seven. We’re decoupling from ActiveJob for critical operations.”

@Captain Chen>> “Confirmed. Switching to native GoodJob enqueuing for transaction-critical minerals.”

@Captain Patel>> “Solid Queue reverting to direct mode for experimental processing.”

Little did we know that this cascade would pale in comparison to what awaited us when the Galactic Trade Consortium would later mandate Quantum Relay integration—but that’s a tale of how even the mighty Sidekick would need reinforcements.

The solution wasn’t to force unity, but to embrace each system’s strengths:

# Hybrid approach - use the right tool for the job
class MiningOrchestrator
  def process_asteroid(asteroid)
    case asteroid.mineral_type
    when :common
      # High volume, low criticality - use Sidekiq
      # Best practice: Pass simple JSON-serializable IDs, not objects
      FastScanJob.set(queue: 'default').perform_later(asteroid.id)

    when :rare
      # Needs transaction guarantees - use GoodJob
      # Note: GoodJob uses higher number = higher priority
      RareMineralJob.set(queue: 'critical', priority: 10).perform_later(asteroid.id)

    when :experimental
      # New processing logic - use SolidQueue
      ExperimentalJob.set(queue: 'background').perform_later(asteroid.id)
    end
  end

  # Critical: Configure each adapter properly
  def self.configure_adapters
    # Sidekiq: Optimize for throughput
    Sidekiq.configure_server do |config|
      config.redis = { pool_size: 25 }
      # Best practice: Use an error service
      config.error_handlers << ->(ex, ctx_hash) {
        Honeybadger.notify(ex, context: ctx_hash)
      }
    end

    # GoodJob: Configured via Rails initializer
    # See config/initializers/good_job.rb above

    # SolidQueue: Configure concurrency control
    SolidQueue.default_concurrency_control_period = 3.minutes
  end
end

Lessons from the Cascade

Lessons Learned

ARIA> “Captain, the cascade has been contained. All vessels report stable job processing.”

Seuros> “What did we learn, ARIA?”

ARIA> “Several critical insights:“

1. Abstraction Has Costs

ActiveJob provides a unified interface, but it can’t hide fundamental differences:

Sidekiq’s Redis-based speed
GoodJob’s PostgreSQL transaction guarantees
SolidQueue’s database-agnostic simplicity

2. Choose Your Weapon

# Use Sidekiq when:
# - You need maximum throughput (thousands/sec)
# - Jobs are idempotent (can be safely retried)
# - Redis infrastructure is available
# - Processing millions of jobs
# - Can tolerate at-least-once execution

# Use GoodJob when:
# - You're already using PostgreSQL
# - You need ACID transaction guarantees
# - Job history and auditing are important
# - Want to avoid Redis operational overhead
# - Queue latency tolerance allows database polling

# Use SolidQueue when:
# - You want Rails 8 defaults
# - Database-agnostic deployment needed
# - Modern, simple architecture appeals
# - Starting new Rails projects
# - Comfortable with newer, evolving system

3. Native Features Matter

Don’t fight your adapter’s nature:

# Good: Use native features when needed
if defined?(Sidekiq)
  # Use Sidekiq's bulk enqueuing
  Sidekiq::Client.push_bulk(
    'class' => Worker,
    'args' => asteroid_ids.map { |id| [id] }
  )
elsif defined?(GoodJob)
  # Use GoodJob's batch operations
  GoodJob::Batch.enqueue do
    asteroid_ids.each { |id| ScanJob.perform_later(id) }
  end
end

4. Monitor Everything

# Essential metrics for each system
class JobMonitor
  def self.check_health
    {
      sidekiq: check_sidekiq_health,
      goodjob: check_goodjob_health,
      solidqueue: check_solidqueue_health
    }
  end

  def self.check_sidekiq_health
    stats = Sidekiq::Stats.new
    {
      enqueued: stats.enqueued,
      retry_size: stats.retry_size,
      dead_size: stats.dead_size,
      redis_memory: Sidekiq.redis { |c| c.info['used_memory_human'] }
    }
  end

  def self.check_goodjob_health
    {
      pending: GoodJob::Job.pending.count,
      running: GoodJob::Job.running.count,
      finished: GoodJob::Job.finished.count,
      database_size: GoodJob::Job.count
    }
  end
end

Captain’s Reflection

Captain's Bridge

As I stood on the bridge of the Recursive Dream, watching our fleet efficiently process the Helix Nebula’s riches, I thought about my journey with these systems.

I helped create ActiveJob to solve a real problem—the Tower of Babel that was background job processing in Ruby. And it succeeded. But like any abstraction, it has limits.

My years maintaining Sidekiq taught me that sometimes, raw speed matters. The Sidekick processes more jobs in an hour than most ships see in a month. But that speed comes with operational complexity—Redis clustering, memory management, connection pools.

GoodJob represents a different philosophy—one that says “your database is already mission-critical, why add another moving part?” For many missions, it’s absolutely right.

And SolidQueue? It’s the future—taking the lessons we’ve learned and building something simpler, more maintainable.

ARIA> “Captain, final statistics from the mining operation. Despite the cascade, we extracted 94% of available quantum crystals. Efficiency was… acceptable.”

Seuros> “Acceptable? ARIA, we nearly crashed three ships and almost lost a PostgreSQL cluster.”

ARIA> “Yes, Captain. But we learned. Next time will be better.”

She was right, of course. In space, as in software, our greatest failures often teach us the most.

Epilogue: The Right Tool

Weeks later, at the Federation Software Architecture Summit at Starbasecamp 37, I presented our findings. The room—filled with engineers from across the fleet—listened intently.

Seuros> “The cascade taught us that unity doesn’t mean uniformity. ActiveJob remains invaluable for application portability, but understanding your adapter isn’t optional—it’s essential.”

An engineer from a cargo vessel raised her hand. “Captain, which system do you recommend?”

Seuros> “The one that fits your mission. Running a high-frequency trading platform? Sidekiq. Building a medical records system? GoodJob. Starting fresh with Rails 8? Give SolidQueue a serious look. The beauty of ActiveJob is you can change your mind later.”

@DHH’s hologram flickered to life from Earth>> “Captain Seuros speaks wisdom. We built Rails to be omakase—a chef’s selection of best practices. But sometimes, you need to order à la carte.”

The audience laughed. Even in 2153, DHH’s philosophy echoed through the stars.

Technical Appendix

For those implementing job processing in their own vessels, here’s our recommended configuration:

Sidekiq Configuration

# config/sidekiq.yml
:concurrency: 25  # Don't exceed 50; match database pool size
:timeout: 25
:queues:
  - [critical, 3]
  - [default, 2]
  - [low, 1]

# Default retry behavior: 25 retries over ~20 days
# with exponential backoff: (retry_count ** 4) + 15 + rand(10)

# Ensure config/database.yml has matching pool:
# pool: <%= ENV['RAILS_MAX_THREADS'] || 25 %>

GoodJob Configuration

# config/initializers/good_job.rb
Rails.application.configure do
  config.good_job = {
    # Queue configuration based on latency tolerance
    queues: "critical:2; default:4; low,default:2",
    max_threads: 8,
    poll_interval: 30,
    preserve_job_records: true,
    
    # Prevent cascade
    retry_on_unhandled_error: false,
    
    # Memory and performance settings
    max_cache: 10_000,  # Max scheduled jobs in memory (~20MB)
    shutdown_timeout: 30,  # Seconds to wait for jobs to finish
    
    # Enable cron for recurring jobs
    enable_cron: false  # Set to true if using GoodJob cron
  }
end

# Note: GoodJob priority is inverted - higher number = higher priority

SolidQueue Configuration

# config/solid_queue.yml (or config/queue.yml in Rails 8+)
production:
  dispatchers:
    - polling_interval: 1
      batch_size: 500
      concurrency_maintenance_interval: 300
  workers:
    - queues: [real_time, background]  # Or "*" for all queues
      threads: 5
      polling_interval: 0.1
      processes: 3

# For recurring jobs, create config/recurring.yml:
# production:
#   schedule:
#     cleanup_job:
#       class: CleanupJob
#       schedule: "every day at 3am"

Remember: In the vastness of space, your background jobs are often the only thing standing between order and chaos. Choose wisely.

Captain’s Log, Stardate 2153.172 - End Transmission

Captain Seuros, Mining Command Vessel “Recursive Dream” Background Job Processing Division, Moroccan Royal Naval Service “In space, no one can hear your jobs retry”