Writing

Lessons From Building Large Systems

What actually holds up as systems grow: clear boundaries, boring technology, and designing for the day things go wrong.

Most systems don’t fail because of a single bad decision. They fail because dozens of reasonable decisions accumulate, and no one steps back to notice the shape they’ve formed. After years of building and operating large systems, the lessons that have held up are not clever — they’re disciplined.

Boundaries matter more than components

Early in a system’s life, teams obsess over which database, which framework, which queue. Those choices matter, but far less than where you draw the lines between parts of the system.

A good boundary hides a decision. It lets one team change how something works without coordinating with five others. A bad boundary leaks implementation details everywhere, so a small change ripples across the codebase and the org chart at the same time.

When I review an architecture, the first question I ask is not “what is this built with?” It’s “what can change behind this interface without anyone else noticing?” If the answer is “not much,” the boundary is decorative.

Prefer boring technology

Every system has a budget for novelty. Spend it deliberately.

A new datastore, a new language, and a new deployment model on the same project is not innovation — it’s three simultaneous bets you’ll be debugging at 2 a.m. The teams that move fastest over years tend to use a small set of well-understood tools and reserve their inventiveness for the problem that’s actually unique to their domain.

The goal is not the most interesting system. It’s the system you can reason about when you’re tired and something is on fire.

Design for failure, not just for success

The happy path is the easy part. What separates a system that scales from one that merely works in a demo is how it behaves when a dependency is slow, a node dies, or a message arrives twice.

A few principles that consistently pay off:

  • Make operations idempotent. If a request can be safely retried, half of your distributed-systems problems become tolerable instead of catastrophic.
  • Set timeouts and budgets everywhere. An unbounded wait is an outage in slow motion.
  • Isolate blast radius. Bulkheads, rate limits, and circuit breakers exist so one failing component doesn’t take the rest down with it.
  • Assume retries and duplicates. At scale, “exactly once” is a story you tell; “at least once, handled idempotently” is what you ship.

Observability is part of the design

You cannot operate what you cannot see. Logs, metrics, and traces are not an afterthought to bolt on before launch — they’re how you understand the system’s actual behavior versus the behavior you imagined.

The practical test: when something is slow, can an on-call engineer go from “users are unhappy” to “this specific dependency is the cause” in minutes, without reading the source code? If not, the system is under-instrumented, no matter how elegant the code is.

Data outlives code

Application code is replaceable. Data schemas and the contracts around them are not. A clean schema with clear ownership will forgive a lot of messy code above it. A tangled data model will undermine even the most beautiful service layer.

Treat schema changes, data migrations, and event contracts with the seriousness they deserve. They’re the parts of the system that are genuinely expensive to get wrong.

Scale is a function of clarity

The systems that scale well are rarely the most sophisticated ones. They’re the ones where each part has a clear job, failures are contained, and the people operating them understand how it behaves. Scale is less about handling more load and more about keeping the system understandable as it grows.

That’s the quiet discipline of building large systems: resisting complexity you don’t need, so you have room for the complexity you do.