Amit Yadav

I am staring at a laggy consumer group that has been falling behind for three hours. The ops channel is quiet, but the lag metrics are screaming.

This is usually the moment you realize your "simple" pub-sub architecture is actually a ticking time bomb of unhandled edge cases.

I have spent the last few years bouncing between Go and Node services, wrestling with high-throughput pipelines. If there is one thing I have learned, it is that Kafka is not a database, and it is definitely not just a "fast queue." It is a distributed commitment to complexity.

Here are the five patterns that actually saved my sleep schedule.

1. The Outbox Pattern: Consistency Without Distributed Transactions

We have all been there: you update a user's balance in Postgres, but the network blips before you can emit the payment_processed event to Kafka. Now your database and your downstream analytics are out of sync. There is no distributed transaction to fix this for you.

The Outbox pattern is the only way I have found to stay sane. Instead of hitting Kafka directly from your service, you write the event to a dedicated outbox table in your local DB within the same transaction as your business logic. A separate worker (a sidecar, a CDC stream, or a background job) tails the outbox and publishes to Kafka.

BEGIN;
  UPDATE accounts SET balance = balance - 100 WHERE id = 42;
  INSERT INTO outbox (topic, payload, created_at)
    VALUES ('payment_processed', '{"id":42,"amount":100}', now());
COMMIT;

Either both rows land or neither does. Publish-to-Kafka becomes an at-least-once retry problem instead of a consistency problem.

2. Idempotent Consumers: Survive At-Least-Once Delivery

Kafka guarantees at-least-once by default. That means your consumer will see the same message twice, eventually. If processing it twice mutates state (decrement inventory, charge a card, send an email), you are in trouble.

The fix is to make the consumer idempotent. Two common approaches:

Dedupe by message id — keep a Redis set of processed event IDs with a TTL longer than your expected retry window. Skip if seen.
Conditional writes — encode the operation as INSERT ... ON CONFLICT DO NOTHING or use a unique constraint on (aggregate_id, event_version) so the second write is a no-op.

You do not need both. Pick the one that fits the storage you already have.

3. Dead Letter Queue: Stop the Poison Pill

A single malformed message can block an entire partition. The consumer retries, fails, retries, fails — and lag balloons because nothing else gets through.

The Dead Letter Queue pattern routes any message that fails after N attempts to a dedicated topic.dlq for human inspection. The main consumer commits the offset and keeps moving.

main-topic → consumer → success → commit
                     ↓ fail x N
                  dlq-topic → on-call alert

A few rules I follow:

Always include the original headers, original topic, and the exception in the DLQ payload — future-you will need them.
Build a small replay tool that can ship messages back to the main topic once the bug is fixed. Without it, the DLQ becomes a graveyard.
Alert on DLQ growth, not just consumer lag. A flatlining lag with a growing DLQ is worse than a spike.

4. Schema Registry: Evolve Without Breaking Consumers

If two services agree on a JSON shape over Slack, they will disagree on it within a quarter. Use Avro or Protobuf with a schema registry (Confluent, Apicurio, or your own) and enforce a compatibility mode — BACKWARD is the default I reach for.

What changes safely:

Adding an optional field with a default
Removing a field that has a default
Renaming via aliases

What breaks downstream:

Removing a required field
Changing a field's type
Renaming without an alias

The registry rejects bad changes at the producer side, before any consumer sees them. The cost is one extra HTTP call at startup. The benefit is never being woken up because someone shipped a "tiny" field rename.

5. Log Compaction: Kafka as a State Store

Most Kafka topics are time-ordered streams — you keep the last N days and let everything older expire. Compacted topics are different: Kafka keeps only the latest message per key, forever (or until you delete the key with a tombstone).

This turns a topic into a durable, replayable state store. Real uses:

A users.profile topic where the latest value for each user id is the current profile.
A feature.flags topic that any service can subscribe to and rebuild local state from on startup.
A cache.invalidation topic that nodes consume to rebuild a hot cache.

Pair compaction with a low min.compaction.lag.ms so old values disappear quickly, and remember that ordering is only guaranteed within a single key — design your keys deliberately.

What I tell my team

These five patterns will not make Kafka simple. Kafka is not simple. They will, however, push the inevitable failure modes into shapes you have already thought about. That is the whole job: turn surprises into known problems.

If you are starting a new pipeline today, build the Outbox first, idempotency second, DLQ third. Schema discipline and compaction can wait until you have something that actually moves messages — but do not wait too long.