weird dns problems are routing problems

This is a story we dealt with some months ago. After a major upgrade of the pipes that connect us with one of our upstreams (lets call them O) our support lines started getting complaints that certain sites could not be reached (RapidShare was a notable example). At first this was thought to be a temporary routing problem, but as calls started to amass within the hour, we looked further into it.

Since we already had a list of sites that were not reachable, we used the first tool one uses at such situations: traceroute. What I immediately observed was that prior to failing, traceroute took sometime to execute. Could it be that the DNS servers and not the web sites were not reachable?

Since we run a separate setup that forwards queries to OpenDNS, I used that DNS server, and traceroute worked! So it indeed was a routing problem, only it was not a routing problem to the web site itself, rather to its name servers who were unreachable by our DNS servers. Tracerouting by hand to the IP addresses of the DNS servers of the “problematic” domains verified that. Luckily we could reach OpenDNS, who in turn could reach (and cached) the DNS servers we could not.

While in fact a very small part of the Internet was not reachable at the time, due to the fact that lots of DNS servers serving other domains lived there, a significantly larger part of the web was invisible. Such issues do occur when your DNS and your hosting provider are networks appart.

We opened a ticket with O and tried to resolve the situation. It seemed that the problem was a case of asymmetric routing, since answers to our packets were returning via our other upstream (lets call them F). The weirdness of describing the problem confused at first the routing engineers at F who could not believe that we were using DNS servers that were not forwarding to theirs. Luckily their DNS master is a good friend and helped get past that quickly. They (F) located the problem to the configuration of one of their upstreams (lets call them L). They opened a ticket and we all waited. In the mean time they (F) devised a backup plan, in case L did not want to cooperate. But they did.

So when you think (or are told) that a web site is unreachable, always check whether its DNS servers are reachable first.

2 thoughts on “weird dns problems are routing problems

  1. Once upon a time, a large upstream, let’s call them O, wasn’t advertising the customers who had their own AS to, but only O’s address space. But since O had a peering with k, every DNS packet from an AS enabled customer destined to k was blackholed…. So k’s anycast was nocast :)

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s