title: "Making Background Jobs Idempotent" date: "2026-05-20" excerpt: "Most job-queue bugs aren't caused by the job failing — they're caused by it running twice. A few design patterns that make retries safe by default." tags: ["backend", "distributed-systems", "reliability"]
At-least-once delivery is the default for every job queue I've used in production. SQS, Sidekiq, BullMQ, Cloud Tasks — they all promise your job will run, but not that it'll run exactly once. The result: every worker you write should assume it might be called a second time before the first call finishes.
Most of the time it isn't. Then one deployment happens at the wrong moment, a network partition causes a visibility timeout, or an instance is preempted — and suddenly you have duplicate charges, double emails, or two database rows where there should be one.
The core idea
A function is idempotent when calling it multiple times with the same input produces the same result as calling it once. For a job that sends a welcome email, "same result" means the user gets exactly one welcome email regardless of how many times the job runs.
There are three cheap ways to get there.
1. Natural idempotency
Some operations are idempotent by nature. Updating a row to a specific state
(SET status = 'verified') is safe to replay. So is an upsert on a unique key.
If you can reformulate the work as a set operation instead of an append
operation, you get idempotency for free.
The trap: side effects outside the database. Sending an HTTP request, writing to S3, charging a card — these are not naturally idempotent and need explicit handling.
2. Idempotency keys
Before performing an irreversible side effect, record your intent in a table keyed on a stable identifier — the job ID, a transaction ID, whatever uniquely names this unit of work.
CREATE TABLE job_completions (
idempotency_key TEXT PRIMARY KEY,
completed_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
In the worker:
BEGIN;
INSERT INTO job_completions (idempotency_key) VALUES ($1)
ON CONFLICT DO NOTHING
RETURNING idempotency_key;
-- if no row returned, another run already completed this job — exit early
COMMIT;
-- now do the side effect
The ON CONFLICT DO NOTHING / RETURNING pattern lets you claim the work
atomically. Only one concurrent worker wins the insert.
3. Conditional side effects
For external APIs that support it, pass the idempotency key through to the
downstream call. Stripe's API accepts an Idempotency-Key header; most modern
payment processors and messaging APIs have an equivalent. The provider deduplicates
on their side, so even if your worker retries after a network timeout, the charge
doesn't go through twice.
When the external API doesn't support idempotency keys natively, check for the side effect before taking it:
if not email_already_sent(user_id, template: "welcome"):
send_email(...)
record_email_sent(user_id, template: "welcome")
The check-then-act window is a race condition, but for most business operations the probability of two workers racing through the check within milliseconds is low enough to accept, especially combined with a database-level lock on the email record.
What to log
When a job exits early because it detects a duplicate, log it as a distinct event — not a failure, not a success, but a deduplication. This makes it easy to measure how often duplicates actually occur and catch regressions if a deployment causes a spike.
The rule I try to follow: if the job can't be made idempotent cheaply, that's a signal the work should be broken into smaller, naturally idempotent steps. A complex job that does ten things is harder to make safe than ten jobs that each do one thing.