K3s DNS 故障排除:Cloudflared 重啟

當關閉 Tailscale 後,K3s 的 Cloudflared 突然陷入 CrashLoopBackOff?本文記錄了如何從 DNS 解析錯誤中,揪出 systemd-resolved 127.0.0.53 導致 CoreDNS 產生無限迴圈 (Loop) 的根因與解決方案。

1. 故障現象 (Symptom)

關閉 tailscale 並重啟 k3s 後,發現 cloudflared Pod 會持續 CrashLoopBackOff,並拋出以下錯誤:

ERR Failed to fetch features, error="lookup cfd-features.argotunnel.com on 10.43.0.10:53: server misbehaving"

2. 深入排查

透過檢查 CoreDNS 的 log, DNS 解析確實失敗了但為什麼?

DNS 解析流程
  1. k3s 內部 dns
  2. 自定義 host
  3. /etc/resolv.conf
那最後這個 /etc/resolv.conf 怎麼來的
  • CoreDNS pod yaml 裡面可以看到這樣一行 dnsPolicy: Default 查閱 k8s 文件 會發現這個 policy 就是直接繼承 host 的 /etc/reslov.conf
  • Host /etc/reslov.conf
# This is /run/systemd/resolve/stub-resolv.conf managed by man:systemd-resolved(8).
# Do not edit.
#
# This file might be symlinked as /etc/resolv.conf. If you're looking at
# /etc/resolv.conf and seeing this text, you have followed the symlink.
#
# This is a dynamic resolv.conf file for connecting local clients to the
# internal DNS stub resolver of systemd-resolved. This file lists all
# configured search domains.
#
# Run "resolvectl status" to see details about the uplink DNS servers
# currently in use.
#
# Third party programs should typically not access this file directly, but only
# through the symlink at /etc/resolv.conf. To manage man:resolv.conf(5) in a
# different way, replace this symlink by a static file or a different symlink.
#
# See man:systemd-resolved.service(8) for details about the supported modes of
# operation for /etc/resolv.conf.

nameserver 127.0.0.53
? 為什麼指向 127.0.0.53
  • ubuntu 有另一個服務 systemd-resolved 會在你與真正的 dns 中間做解析(Stub 模式)
那這樣的設定檔進到 CoreDNS 後會怎麼樣

[FATAL] plugin/loop: Loop (127.0.0.1:55281 -> :53) detected for zone "."

為什麼會產生 Loop?
  • 剛剛有提到解析流程最後一跳 CoreDNS 發現是自己
127.0.0.53 是甚麼
127.0.0.0/8 - This block is assigned for use as the Internet host
loopback address. A datagram sent by a higher-level protocol to an
address anywhere within this block loops back inside the host. This
is ordinarily implemented using only 127.0.0.1/32 for loopback. As
described in [RFC1122], Section 3.2.1.3, addresses within the entire
127.0.0.0/8 block do not legitimately appear on any network anywhere.

3. 為什麼關掉 Tailscale 就炸了?

  • Tailscale 啟動時,會接管 /etc/resolv.conf 裡面的 nameserver 會是 100.100.100.100
  • 當關掉之後, /etc/resolv.conf 就變成上面的那個樣子了

4. 解決方案 (Solutions)

  1. 將宿主機的 將 /etc/resolv.conf 指向 systemd-resolved 真實的外部上游設定檔

sudo ln -sf /run/systemd/resolve/resolv.conf /etc/resolv.conf

  1. 修改 CoreDNS ConfigMap (持久化!)
還沒改過
  1. 修改 K3s 啟動參數

在 K3s 啟動命令中加入 --resolv-conf 參數,指定一個不含 127.0.0.53 的檔案路徑。

REF:

pod 的 /etc/resolv.conf 生成机制
从源码角度来看 pod 的 /etc/resolv.conf 生成机制
DNS for Services and Pods
Your workload can discover Services within your cluster using DNS; this page explains how that works.

Subscribe to virgil246 Blog

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
[email protected]
Subscribe