At around 5pm CEST on Tuesday, 19.5.2020 we noticed connectivity problems on the Renku platform, with intermittent outages of services and part of the platform not loading correctly for some of our users, which was ultimately caused by a failing DNS server managed by our cloud provider.
The alarming initial fact about this outage was that it occurred in three separate deployments in three kubernetes clusters spread across two zones. We started investigating the issue and tried restarting the affected nodes and services with little success. Towards the evening we had narrowed down the cause of the connectivity problem to individual pods in our Kubernetes deployment not being able to communicate with each other despite connectivity between them being available and working normally.
The reason for this was that the DNS service in our cluster failed to resolve external public hostnames. The issue only affected certain pods and while DNS resolution with nslookup
worked normally on these pods, hostname lookups inside applications were failing.
We managed to further narrow it down to an issue with how some Linux systems resolve domains and how they deal with failures from nameservers. Namely, the /etc/resolv.conf
file is used to determine how a domain is resolved. In our case, that file looked something like this:
nameserver 1.2.3.4
search renku.svc.cluster.local svc.cluster.local cluster.local some.external.server.ch
options ndots:5
This tells us that the default nameserver for name resolution to use is at 1.2.3.4, that non-FQDN domains should also be tried with the suffixes in search and that if the domain to be looked up has less than 5 dots in it (the ndots option), then the lookup should start by appending the suffixes first and only after those fail try the plain domain. So for a lookup like renkulab.io
, since it has only one dot in it, the system would first try to look up renkulab.io.renku.svc.cluster.local
, renkulab.io.svc.cluster.local
, renkulab.io.cluster.local
, renkulab.io.some.external.server.ch
before looking up renkulab.io
itself. This is normal behavior in a Kubernetes cluster. Usually all the previous requests (for addresses that don’t exist) would return an NXDOMAIN
from the nameserver, meaning the nameserver doesn’t know about the domain, so it doesn’t exist from its point of view. In our case, asking for any domain in the some.external.server.ch
namespace would instead return a SERVFAIL
status code. This indicates that the remote DNS server for that domain experiences some error when trying to process the request.
Depending on the Linux version and other factors at the OS level, a SERVFAIL
either means that the next address is tried, or that the whole DNS lookup fails. Keep in mind that this search is performed for all URLs with less than 5 dots in them and that if a search returns a successful result, the search stops there successfully and does not continue. So some DNS lookups might always work (Due to number of dots or a search being successful), some lookups might work on system A (where SERVFAIL
is ignored) but not system B, despite both using the same name server, making this problem very difficult to debug.
In the end, we identified an external DNS server managed by our cluster provider that always returned a SERVFAIL
error when queried for any subdomain, ultimately causing the DNS lookup problems in our pods that meant they couldn’t find each other. We informed our provider of this issue and they fixed it by restarting the DNS in question, returning our platform back to normal.
Moving forward, we will monitor DNS problems more closely and we have a solution in place to patch our nodes to ignore failing external DNS servers, should this problem ever come back. This allows us a much quicker response and should lead to minimal downtime in cases like these.