renkulab interactive environments partial outage

Incident Report for renkulab

Postmortem

At around 5pm CEST on Tuesday, 19.5.2020 we noticed connectivity problems on the Renku platform, with intermittent outages of services and part of the platform not loading correctly for some of our users, which was ultimately caused by a failing DNS server managed by our cloud provider.

The alarming initial fact about this outage was that it occurred in three separate deployments in three kubernetes clusters spread across two zones. We started investigating the issue and tried restarting the affected nodes and services with little success. Towards the evening we had narrowed down the cause of the connectivity problem to individual pods in our Kubernetes deployment not being able to communicate with each other despite connectivity between them being available and working normally.

The reason for this was that the DNS service in our cluster failed to resolve external public hostnames. The issue only affected certain pods and while DNS resolution with nslookup worked normally on these pods, hostname lookups inside applications were failing.

We managed to further narrow it down to an issue with how some Linux systems resolve domains and how they deal with failures from nameservers. Namely, the /etc/resolv.conf file is used to determine how a domain is resolved. In our case, that file looked something like this:

nameserver 1.2.3.4

search renku.svc.cluster.local svc.cluster.local cluster.local some.external.server.ch
options ndots:5

This tells us that the default nameserver for name resolution to use is at 1.2.3.4, that non-FQDN domains should also be tried with the suffixes in search and that if the domain to be looked up has less than 5 dots in it (the ndots option), then the lookup should start by appending the suffixes first and only after those fail try the plain domain. So for a lookup like renkulab.io, since it has only one dot in it, the system would first try to look up renkulab.io.renku.svc.cluster.local, renkulab.io.svc.cluster.local, renkulab.io.cluster.local, renkulab.io.some.external.server.ch before looking up renkulab.io itself. This is normal behavior in a Kubernetes cluster. Usually all the previous requests (for addresses that don’t exist) would return an NXDOMAIN from the nameserver, meaning the nameserver doesn’t know about the domain, so it doesn’t exist from its point of view. In our case, asking for any domain in the some.external.server.ch namespace would instead return a SERVFAIL status code. This indicates that the remote DNS server for that domain experiences some error when trying to process the request.

Depending on the Linux version and other factors at the OS level, a SERVFAIL either means that the next address is tried, or that the whole DNS lookup fails. Keep in mind that this search is performed for all URLs with less than 5 dots in them and that if a search returns a successful result, the search stops there successfully and does not continue. So some DNS lookups might always work (Due to number of dots or a search being successful), some lookups might work on system A (where SERVFAIL is ignored) but not system B, despite both using the same name server, making this problem very difficult to debug.

In the end, we identified an external DNS server managed by our cluster provider that always returned a SERVFAIL error when queried for any subdomain, ultimately causing the DNS lookup problems in our pods that meant they couldn’t find each other. We informed our provider of this issue and they fixed it by restarting the DNS in question, returning our platform back to normal.

Moving forward, we will monitor DNS problems more closely and we have a solution in place to patch our nodes to ignore failing external DNS servers, should this problem ever come back. This allows us a much quicker response and should lead to minimal downtime in cases like these.

Posted May 22, 2020 - 07:44 CEST

Resolved

This incident has been resolved.

Posted May 20, 2020 - 21:39 CEST

Monitoring

A fix has been implemented in our cloud provider, we are currently monitoring the state of our components

Posted May 20, 2020 - 17:22 CEST

Identified

The issue has been identified and we are working on a fix

Posted May 20, 2020 - 16:01 CEST

Update

We are continuing to identify the underlying problem.

Posted May 20, 2020 - 01:55 CEST

Investigating

We are currently experiencing connectivity problems that affect listing and starting environments and integration with the knowledge graph

Posted May 19, 2020 - 20:17 CEST

This incident affected: Renkulab web UI and Knowledge Graph.