It may be the case that when you deploy a new Rancher2 Kubernetes cluster, all pods are working fine, with the exception of
cattle-cluster-agent (whose scope is to connect to the Kubernetes API of Rancher Launched Kubernetes clusters) that enters a CrashLoopBackoff state (red state in your UI under the System project).
One common error you will see from View Logs of the agent’s pod is 404 due to a HTTP ping failing:
ERROR: https://rancher-ui.example.com/ping is not accessible (The requested URL returned error: 404)
It is a DNS problem
The issue here is that if you watch the network traffic on your Rancher2 UI server, you will never see pings coming from the pod, yet the pod is sending traffic somewhere. Where?
Observe the contents of your pod’s
search default.svc.cluster.local svc.cluster.local cluster.local example.com
Now if you happen to have a wildcard DNS A record in
example.com the HTTP ping in question becomes
http://rancher-ui.example.com.example.com/ping which happens to resolve to the A record of the wildcard (most likely not the A RR of the host where the Rancher UI runs). Hence if this machine runs a web server, you are at the mercy of what that web server responds.
One quick hack is to edit your Rancher2 cluster’s YAML and instruct the kubelet to start with a different
resolv.conf that does not contain a search path with your domain with the wildcard record in it. The kubelet appends the search path line to the default and in this particular case you do not want that. So you tell your Rancher2 cluster the following:
resolv.rancher contains only nameserver entries in my case. The path is
/host/etc/resolv.rancher because you have to remember that in Rancher2 clusters, the kubelet itself runs from within a container and access the host’s file system under
Now I am pretty certain this can be dealt with, with some coredns configuration too, but did not have the time to pursue it.
We’re using unbound internally for DNS resolution. It works smoothly and allows for some DNS tricks when you want to implement some split-brain trickery, but not a complete split-brain deployment. The other day we needed to send out conditional replies based on the IP address of the querying machine. Unbound comes with a python module but it has some of the weirdest, unhelpful documentation ever. I am not alone in believing this.
It is very hard to figure out the source IP address of a DNS query using the unbound python library. My first pointer on how to do so was on ServerFault. I have uploaded my own version of an operate function at pastebin. The code in question that you need to consider is:
# Find out source IP address
rl = qstate.mesh_info.reply_list
q = rl.query_reply
rl = rl.next
# Careful with this conditional
try: addr = q.addr
except NameError: addr = None
The try … except handling is needed because I found out that sometimes the q.addr may not be defined and thus further down the line you may be bitten by an abnormal exit of your script.
Update: two friends have suggested that I change the while loop with a more Pythonic list comprehension:
q = next((x for x in qstate.mesh_info.reply_list if x.query_reply), None)
try: addr = q.query_reply.addr
except NameError: addr = None
One of them actually has a pretty cool pastebin about it.
At work we try to manage as much as we can with terraform. This also includes Route53 for zones and records. In a certain situation we had about 14 zones and 1476 records managed in a single state file.
As it happened I needed a zone recreated (but not erased) and this affected about 409 records. Well deleting them with terraform apply took ages. To the point that the temporary STS token expired and botched the process. So after a little facepalming, I decided to cleanup the zone from the AWS console and then issue a batch of terraform state rm to reconcile the state. Happily, after that, apply took its time (but reasonably) and all was well.
I am thinking that next time I am faced with such a situation, to lock the state file in Dynamo, copy it over from S3, manipulate it locally, unlock and run a plan to see how it all plays out. Or, wherever I can, use a state per zone instead of a state file encompassing a set of zones.
I moved to a new flat today, and after unpacking and general housekeeping, it was time to connect the Echo to the network. Unfortunately, it refused to play ball.
So I reset it to factory defaults. No luck. The process was hanging when it tried to connect to the net. The progress bar stopped at some point after being halfway through. Factory defaults again. No luck. But the old mother of all evil came into mind:
Everything is a DNS problem.
Could it be so? Of course it could. I am running dnscrypt-proxy so my DNS server is always set to
127.0.0.1 and not whatever the DHCP proxy serves. So, let’s get the default from the network:
networksetup -setdnsservers Wi-Fi empty
I then pointed my browser to alexa.amazon.com (yes I am not using the app) and the configuration completed without a hassle! I switched back to using OpenDNS FamilyShield by:
networksetup -setdnsservers Wi-Fi 127.0.0.1
For anyone interested I
brew install dnscrypt-proxy and my dnscrypt-proxy.conf has:
I’ve started reading John Day’s Patterns in Network Architecture and during the first pages it makes strong references to Saltzer’s 1982 paper. Why would I bring this up? Well, I just heard Surprisingly Awesome‘s episode on Postal codes where they deal with two countries (Lebanon and Mongolia) with almost non-existent addressing plans. Here is what an addressing plan should give you:
- a name identifies what you want,
- an address identifies where it is, and
- a route identifies a way to get there
Day makes the case that we usually use that address of a network element in the same way that we use its name also which is an error, since by moving an element elsewhere in the network, we need to change its name also. You on the other hand do not change your name when you change your home address. You used to change your phone number, but even that has become equally portable.
In places where no stable addressing system exists the courier is required to build a mental representation of the routes in their area of delivery, based on landmarks, trees, neon signs, whatever can help to make the delivery. In Mongolia this is solved differently: When something arrives at the post office, they call you back and you go and pick it up.
Enter the NAC. What is it exactly? It is an effort to map longitude and latitude to a more memorable representation using the base 30 number system using digits and capital letters. Borrowing from Wikipedia, the NAC for the centre of the city of Brussels is HBV6R RG77T. Compact, accurate, but not quite memorable.
what3words seems to be a service set to solve this since with their solution a unique combination of just 3 words identifies a 3m x 3m square anywhere on the planet. For example, roughly the same place as above is described as october.donor.outlined. I admit, this is much easier to type in a GPS (or tell Siri).
However, I am still surprised that nobody ever thought of using IPv6 for this (maybe somebody has? Please tell me). Given the abundance that the 128bits give us, we could have indexed every square meter on the surface of the planet and make it addressable. Oh, the directories we could have built on top of that. But I have no fear. It is quite probable that much of the inhabited First World’s surface will be pingable in the foreseeable future. The IoT will make sure of that.
I was trying to install a virtual machine using the latest VirtualBox on a Windows 7 Host. The host was also running OpenDNS DNSCrypt 0.0.6 client. The guest operating system should be Debian/LXDE. Installation went fine until the installer tried to contact Debian mirrors to fetch missing packages.
It couldn’t find them. Like the common system administration mantra says:
Everything is a DNS problem.
So at the OpenDNS DNSCrypt client dashboard I (temporarily) disabled the DNS over TCP option and the installation continued smoothly. The same thing does not happen with OS X Mavericks as the host operating system. After the installation is finished, you can reenable DNS over TCP for DNSCrypt. The guest operating system’s resolver sees no issues with this.
I am posting this short note because it may bite others out there.
The Internet Society (ISOC) posted its views on DNS filtering. They are excellently summed up by the ISOC in a single phrase:
The real solution is international cooperation.
The reality though is that DNS filtering is here to stay. And it is here to stay because its initial deployment is far more easier than attacking the problem to its source via international cooperation.
So until DNS filtering (and supporting users mainly) starts costing Service Providers a lot, as in costing that much that it makes international cooperation cost less (even with the bureaucracy involved) it is a fact of everyday life that we have to get used to. Just imagine debugging not being able to access a single site, while at the same time all antivirus vendors run their own private, and allowed to be queried only by machines running their products (a “value added service”), resolvers.
Sad but true.