Pod Dns resolution fails with internal hostname appended
server can’t find google.com.example.com: SERVFAIL
$ nslookup google.com
Server: 10.100.0.10
Address: 10.100.0.10#53
** server can't find google.com.example.com: SERVFAIL
The Symptom
DNS resolution inside Kubernetes pods fails with the internal hostname appended to the end of the query. For example, if the internal domain is example.com
, then running nslookup google.com
will fail with ** server can't find google.com.example.com : SERVFAIL
TL;DR
The nameserver for example.com
is not publicly accessible. The DNS server is unable to find a nameserver who can answer for example.com
, so it returns a SERVFAIL
instead of an NXDOMAIN
. The DNS client inside the pod stops traversing down its list of search domains and returns the error server can't find google.com.example.com: SERVFAIL
The solution
Coredns needs to be configured to forward queries for example.com
to the internal IP of the example.com
nameserver. If Coredns is forwarding the query to a nameserver that is not able to communicate with the example.com
nameserver, such as the Amazon provided DNS at x.x.0.2
or Google at 8.8.8.8
, the response will be SERVFAIL
instead of NXDOMAIN
and DNS queries will fail.
More context
An internal hostname example.com
was added to the DHCP option set of an AWS VPC. Shortly after, many DNS resolution failures were observed in a Kubernetes cluster in this VPC. Those failed queries were for names that clearly did not exist, for example google.com.example.com
It was discovered that Network Manager was adding example.com
to the search list of the resolv.conf
on each of the worker nodes, and that the resolv.conf
in each of the pods was inheriting that extra search domain. ndots
was set to 5 in each resolv.conf
.
When a client makes a DNS query for a domain that has fewer dots than specified in ndots
, it begins to go down the search list appending each domain to the end of its query one by one until it reaches the end of the list. If no results have been returned by the time the client reaches the end of the search list, it will make one final query for the name as it was queried (that is, without any domains from the search list appended to the end). The observed behavior appeared to contradict this. The query would fail before making that final attempt, falling at google.com.example.com
instead of making the final query for google.com
.
The resolve.conf
in each pod looks like this:
$ cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local region.compute.internal example.com
nameserver 10.100.0.10
ndots 5
When a pod begins a query for google.com, it traverses down its search list and makes the following queries:
google.com.default.svc.cluster.local
google.com.svc.cluster.local
google.com.cluster.local
google.com.region.compute.internal
google.com.example.com
then finally
google.com
It was discovered that the pods fail querying google.com.example.com
not because of a coredns issue, but because coredns is forwarding the query to the Amazon provided DNS which is unable to find an address for the example.com
name server. The example.com
nameserver is publicly advertised as xxxx.example.com
but is not publicly accessible. This results in a SERVFAlL
response rather than an NXDOMAIN
. The DNS client interprets the SERVFAIL
as a valid answer and terminates its query instead of continuing on to query for google.com
.
The solution is thankfully a simple one, Coredns is now configured to forward queries for example.com
to the internal IPs of the the example.com
nameservers.