DNS is Simple. DNS is Hard.

How a "simple" lookup system turns into a distributed systems problem

Posted on by Adam Wespiser

DNS is Simple. DNS is Hard.

DNS looks like a simple mapping:

DNS :: Domain Name → IP Address

That’s the mental model most of us carry around:

wespiser.com → 104.21.13.171

It feels like configuration. A lookup. Some project metadata you change, and then it’s changed.

But that’s not what actually happens.

When your application makes a DNS request, it doesn’t go straight to the authoritative server. It goes to a recursive resolver that is run by your ISP, your company, or a public provider like 8.8.8.8.

That resolver:

  1. Queries root servers
  2. Follows referrals to TLD servers
  3. Queries the authoritative name server
  4. Caches the result
  5. Returns the answer

And then every other resolver in the world does the same thing: on its own timeline, with its own cache, with no coordination.

There is no global view of DNS state. There is no control plane. There is no way to ask, “what does the system believe right now?”

When you change DNS, you are not updating configuration.

You are initiating a convergence process across a distributed system you don’t control, can’t observe, and can’t roll back.

At small scale, DNS feels like a lookup.

At internet scale, it behaves like a distributed system.

That gap is where things break.

Internet building block

For a taste of how critical DNS is, on October 21, 2016, Dyn, a DNS provider critical for many of the most popular web platforms, went down for hours.

The attack was basic by modern standards: have your botnet send DNS requests that are more expensive to resolve than they are to generate. Millions of unique subdomains forced resolvers to bypass caches, triggering a flood of upstream lookups that overwhelmed Dyn’s infrastructure.

The result? Reddit, Twitter, PayPal, and others were unavailable for hours.

The real failure wasn’t Dyn went down.

The failure was everyone depended on Dyn.

DNS is one of the few systems where you ship a change, or suffer a failure, and then wait for independent caches across the internet to agree with you.

DNS is hard.


Where it breaks down

Close your eyes and imagine: your phone rings. An exasperated manager pulls you into a service outage. You don’t know anything yet.

What do you check?

Are the servers turned on and getting power?

Is the network connected and are nodes receiving messages?

Does DNS work?

This was the path AWS engineers found themselves walking on the night of October 19–20, 2025, when US-EAST-1 began failing.

By 12:26 AM PDT, the team had narrowed the event to DNS resolution issues for the regional DynamoDB endpoint. The underlying problem: a race condition in DynamoDB’s DNS management system.

In simple terms: the database servers were still there, the network mostly still existed, but the naming layer that told systems how to reach DynamoDB had broken.

The failure wasn’t just a race condition.

It was a race condition in a system where partial state is globally visible—and cached.

Multiple automation paths were updating DNS without coordination. When those updates collided, DNS didn’t fail cleanly. It propagated inconsistent state outward.

Once that happened, everything depending on DynamoDB couldn’t reliably find it.

DNS looks like configuration. But it behaves like a control plane.

DNS is hard.


Check the cache

A few years ago, I worked as an infrastructure engineer at a cloud database company. Our mission was straightforward: take a database, put it in the cloud, and make it reliable for our customers and cheap to run for us.

Also: pick up the phone when things weren’t working, and build the system to minimize such calls.

The DNS portion of this story starts with a desire to save money by removing expensive dependencies like ELB from a simple ingress route:

Route53 → ELB → compute clusters

to something more flexible:

Route53 → Cloudflare Tunnels → compute clusters

On paper, this wasn’t especially complicated.

From a systems perspective, this felt controlled.

From a DNS perspective, we were about to push a global change into a system we didn’t control—and couldn’t observe.


The Plan

The rollout strategy was straightforward:

We targeted a two-hour migration window during working hours, and ran a test migration using a staging environment.

From a systems perspective, this felt safe: we’d done it before, and it didn’t break!

From a DNS perspective, we were initiating a global convergence event and hoping it behaved for our control plane.


The Reality

We only had two ways to know if the DNS change was correct:

We had no global signal, no encompassing metrics dashboard to check. Nothing that told us what the system actually believed. Most of the migration went smoothly. Changes applied, traffic flowed, TLS held.

Then we hit an issue.

Some Kubernetes clusters were holding onto DNS state longer than expected. Even after the change, parts of the system were still resolving the old configuration.

Nothing in Route53 was wrong.
Nothing in Cloudflare was wrong.

But the system wasn’t converging.

We eventually tracked it down to DNS caching inside the clusters. We had to manually restart services to clear the cached state, and all was saved.


The Lesson

From our planning, review, and execution, the migration was correct.

From DNS’s perspective, the system was still in transition somewhere.

That gap is where things break.

DNS doesn’t give you a clean cutover.

Instead, it gives you a period where different parts of the world believe different things about your system.

Unless you explicitly account for that, you don’t have a deployment, you have a coordination problem.

DNS is hard.


How things fail

To summarize where things break:

1. No global view of state

There is no “current DNS state,” only:
“what does resolution look like from here, right now?”


2. Caching

Caching happens everywhere:

You can’t find them all, and you definitely can’t clear them all.


3. Time is a hidden variable

TTL settings exist, but they are not strictly enforced.

DNS doesn’t change instantly. It converges over time—and not all at once.


4. Multi-provider complexity

Route53, Cloudflare, internal DNS—all need to work together.

Each layer adds more state and more ways to be wrong.


5. It’s part of everything

TLS validation, service discovery, load balancing, failover.

When DNS is wrong, infrastructure breaks.


DNS is hard because it’s a distributed system with:


Conclusion

DNS is simple. It’s a name resolution model that fits in your head.

In reality, it’s a globe-spanning distributed system with low visibility, weak consistency, and pervasive caching.

It looks like configuration. It behaves like a control plane.

The gap between those two is where outages live.

DNS is hard.