What happens when all of your DNS name servers go down?

March 24 2016

Today I went to write a post on my website. When I pointed my browser to my editor to start writing, I was returned a screen which said that my domain "edit.dhariri.com" could not be found on the internet. I made a new tab and loaded Google which loaded fine. I then tried my website (this website) "dhariri.com" and again, nothing. Since all of the pieces of my site are on one server, I thought perhaps my server had gone down so I attempted to login to it as I always do with SSH, but still nothing!

After some toggling my internet on and off and googling I surmised that it was a base infrastructure-level failure at Digital Ocean affecting every region around the world's name servers.

I put that in bold because I want you to read it like I realized it in my head when I figured out what was happening. Millions of people's websites, apps and online services were not loading for a large portion of a Thursday afternoon (EST) across the globe.

That's a big bug, so let's take a look at what's happening here!

What is DNS?

DNS stands for "Domain Name System" and is conceptually a big lookup table that your computer uses to find out what IP address (a unique, public number that points directly to a server) is associated with a given name. It's like a massive phone book except the names are domains "dhariri.com" and the phone numbers are IP addresses "159.203.9.81". It was invented because people are lousy at remembering IP address (and phone numbers) and are much better at remembering names.

What happened?

I don't know exactly what happened today with Digital Ocean, but I can give you my best, slightly educated estimate. When you type "dhariri.com" into your browser your request will be bounced around to a few different "phone books" to try to figure out what IP address corresponds to that domain. These are called "name servers". When you buy a domain it usually comes configured to use the domain service's name servers "ns1.godaddy.com" etc... When your website is hosted somewhere else on the internet and you want your host to also manage your domain's settings (Digital Ocean has a nice interface for editing your domain settings) you have to point your domain's registration to your hosting providers name servers "ns1.digitalocean.com". When someone tries to access your website, your registrar will forward the look up request along to this name server (which you provided) which will then handle the response with the correct IP address.

By now you're probably seeing the flaw in this communication pattern. The original request by the user (the person on the computer who typed in "dhariri.com" into his/her browser) is passed along to a bunch of servers who all assume that the next is alive and well. They don't check ahead or respond to failure, they just pass the request along and hurry back to process the next request. This is to be as efficient as possible. That means that if none of the name servers know what domain you're looking for it will just time out and fail which results in a screen like this:

If a name server goes down it's a big deal because requests for websites will fail completely silently and the user who typed in the website won't know why. It's an even bigger deal because this is the front-facing door to the internet and a bunch of services rely on human-readable domains to resolve:

When name servers are created by companies like Digital Ocean they're isn't just one server to process all requests. It's a big responsibility to process these domain name look ups so many servers containing even more "workers" will often be operating under a few main "routers" to provide redundancy so that when you ask "ns1.digitalocean.com" for a domain look up a worker is available immediately to process that request.

When you ask for "dhariri.com" one of five of these huge routers is asked in a priority sequence so if one doesn't resolve within 50ms another is asked. In other words, something very big would have to break for this to all stop working. Either the main pipe for your incoming requests or all the infrastructure supporting these five name server routers would have to fall down.

Today, that happened. And you saw the effect it had on people. My own website, API, editor, image server, Volley, ten other miscellaneous projects I had hosted all broke. Imagine if any of these had non-trivial volume or were processing payments. Imagine if this happened to Uber's API or Stripe's services.

The take away for me is that centralized systems like DNS have risks, which we saw today by the name server failure at Digital Ocean. A big part of the internet is propped up on these services having 100% uptime and availability. Maybe it's time to think of a decentralized DNS? It's a question for folks much more qualified than myself to explore... Maybe it's something block-chain technology could handle well?