Ansible部署K8S出错, TLS handshake error

  • Choerodon平台版本: 0.20

  • 遇到问题的执行步骤:
    第一步安装K8S集群时候出错

  • 文档地址:
    http://choerodon.io/zh/docs/installation-configuration/steps/kubernetes/

  • 环境信息(如:节点信息):
    3个节点。每台配置相同,4核32G内存,160G磁盘

  • 报错日志:
    ingress,dns始终报错。
    E0309 02:46:11.149523 1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.244.64.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
    E0309 02:46:11.149523 1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.244.64.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
    E0309 02:46:11.149523 1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.244.64.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
    E0309 02:46:11.149523 1 reflector.go:126] pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://10.244.64.1:443/api/v1/endpoints?limit=500&resourceVersion=0: net/http: TLS handshake timeout
    I0309 02:46:11.408196 1 trace.go:82] Trace[1363523066]: “Reflector pkg/mod/k8s.io/client-go@v11.0.0+incompatible/tools/cache/reflector.go:94 ListAndWatch” (started: 2020-03-09 02:46:01.407168536 +0000 UTC m=+2717.775732341) (total time: 10.00098729s):
    Trace[1363523066]: [10.00098729s] [10.00098729s] END


    apiserver的日志:
    I0309 10:27:36.331434 1 controller.go:127] OpenAPI AggregationController: action for item v1beta1.metrics.k8s.io: Rate Limited Requeue.
    I0309 10:27:38.538395 1 log.go:172] http: TLS handshake error from 10.244.0.2:55576: EOF
    I0309 10:27:42.986372 1 log.go:172] http: TLS handshake error from 10.244.0.3:33126: EOF
    I0309 10:27:53.833629 1 log.go:172] http: TLS handshake error from 10.244.0.3:33192: EOF
    I0309 10:27:53.994790 1 log.go:172] http: TLS handshake error from 10.244.0.3:33198: EOF
    I0309 10:28:05.015082 1 log.go:172] http: TLS handshake error from 10.2.7.193:61223: EOF
    I0309 10:28:05.015160 1 log.go:172] http: TLS handshake error from 10.2.7.193:5259: EOF
    I0309 10:28:15.754235 1 log.go:172] http: TLS handshake error from 10.244.0.3:33322: EOF
    I0309 10:28:15.895375 1 log.go:172] http: TLS handshake error from 10.2.7.193:37036: EOF
    I0309 10:28:15.994298 1 log.go:172] http: TLS handshake error from 10.244.0.3:33326: EOF
    I0309 10:28:16.023085 1 log.go:172] http: TLS handshake error from 10.2.7.193:10716: EOF
    I0309 10:28:16.023132 1 log.go:172] http: TLS handshake error from 10.2.7.193:13106: EOF
    I0309 10:28:26.903397 1 log.go:172] http: TLS handshake error from 10.2.7.193:3461: EOF
    I0309 10:28:26.986478 1 log.go:172] http: TLS handshake error from 10.244.0.3:33384: EOF
    I0309 10:28:27.031064 1 log.go:172] http: TLS handshake error from 10.2.7.193:35425: EOF
    I0309 10:28:27.033494 1 log.go:172] http: TLS handshake error from 10.2.7.193:59248: EOF

  • 原因分析:

    看描述像是证书验证错误。安装过程中分发看起来都正常。

请帮忙分析一下,谢谢

你好,请将 inventory.ini 文件内容发出来一下

; 将所有节点的信息在这里填写
; 第一个字段 为节点内网IP,部署完成后为 kubernetes 节点 nodeName
; 第二个字段 ansible_port 为节点 sshd 监听端口
; 第三个字段 ansible_user 为节点远程登录用户名
; 第四个字段 ansible_ssh_pass 为节点远程登录用户密码
[all]
10.2.7.191 ansible_port=22 ansible_user=“root” ansible_ssh_pass=“CHERY#paas191”
10.2.7.192 ansible_port=22 ansible_user=“root” ansible_ssh_pass=“CHERY#paas192”
10.2.7.193 ansible_port=22 ansible_user=“root” ansible_ssh_pass=“CHERY#paas193”

; 私有云:
; VIP 负载模式:
; 也就是负载均衡器 + keepalived 模式,比如常用的 haproxy + keepalived。
; 本脚本中负载均衡器有 nginx、haproxy、envoy 可供选择,设置 lb_mode 即可进行任意切换。
; 设置 lb_kube_apiserver_ip 即表示启用 keepalived,请先与服务器提供部门协商保留一个IP作为 lb_kube_apiserver_ip,
; 一般 lb 节点组中有两个节点就够了,lb节点组中第一个节点为 keepalived 的 master 节点,剩下的都为 backed 节点。
;
; 节点本地负载模式:
; 只启动负载均衡器,不启用 keepalived(即不设置 lb_kube_apiserver_ip),
; 此时 kubelet 链接 apiserver 地址为 127.0.0.1:lb_kube_apiserver_port。
; 使用此模式时请将 lb 节点组置空。
;
; 公有云:
; 不推荐使用 slb 模式,建议直接使用节点本地负载模式。
; 若使用 slb 模式,请先使用节点本地负载模式进行部署,
; 部署成功后再切换至 slb 模式:
; 将 lb_mode 修改为 slb,将 lb_kube_apiserver_ip 设置为购买到的 slb 内网ip,
; 修改 lb_kube_apiserver_port 为 slb 监听端口。
; 再次运行初始化集群脚本即可切换至 slb 模式。
[lb]

; 注意etcd集群必须是1,3,5,7…奇数个节点
[etcd]
10.2.7.191
10.2.7.192
10.2.7.193

[kube-master]
10.2.7.191
10.2.7.192
10.2.7.193

[kube-worker]
10.2.7.191
10.2.7.192
10.2.7.193

; 预留组,后续添加master节点使用
[new-master]

; 预留组,后续添加worker节点使用
[new-worker]

; 预留组,后续添加etcd节点使用
[new-etcd]

;-------------------------------------- 以下为基础信息配置 ------------------------------------;
[all:vars]
; 是否跳过节点物理资源校验,Master节点要求2c2g以上,Worker节点要求2c4g以上
skip_verify_node=false
; kubernetes版本
kube_version=“1.16.7”
; 负载均衡器
; 有 nginx、haproxy、envoy 和 slb 四个选项,默认使用 nginx;
lb_mode=“haproxy”
; 使用负载均衡后集群 apiserver ip,设置 lb_kube_apiserver_ip 变量,则启用负载均衡器 + keepalived
; lb_kube_apiserver_ip=“10.2.7.191”
; 使用负载均衡后集群 apiserver port
lb_kube_apiserver_port=“8443”

; 网段选择:pod 和 service 的网段不能与服务器网段重叠,
; 若有重叠请配置 kube_pod_subnetkube_service_subnet 变量设置 pod 和 service 的网段,示例参考:
; 如果服务器网段为:10.0.0.1/8
; pod 网段可设置为:192.168.0.0/18
; service 网段可设置为 192.168.64.0/18
; 如果服务器网段为:172.16.0.1/12
; pod 网段可设置为:10.244.0.0/18
; service 网段可设置为 10.244.64.0/18
; 如果服务器网段为:192.168.0.1/16
; pod 网段可设置为:10.244.0.0/18
; service 网段可设置为 10.244.64.0/18
; 集群pod ip段
kube_pod_subnet=“10.244.0.0/18”
; 集群service ip段
kube_service_subnet=“10.244.64.0/18”

; 集群网络插件,目前支持flannel,calico,kube-ovn
network_plugin=“flannel”

; 若服务器磁盘分为系统盘与数据盘,请修改以下路径至数据盘自定义的目录。
; Kubelet 根目录
kubelet_root_dir="/var/lib/kubelet"
; docker容器存储目录
docker_storage_dir="/var/lib/docker"
; Etcd 数据根目录
etcd_data_dir="/var/lib/etcd"

你好,请看看etcd的pod中日志,是否因为每台节点时间查大于1秒导致

确实有超常的问题

2020/03/09 01:58:49 INFO: raft.node: 5efe9a1ec645aecb elected leader 5efe9a1ec645aecb at term 4
2020-03-09 01:58:52.347056 W | etcdserver: read-only range request “key:”/registry/health" " with result “error:context canceled” took too long (1.999960674s) to execute
WARNING: 2020/03/09 01:58:52 grpc: Server.processUnaryRPC failed to write status: connection error: desc = “transport is closing”
2020-03-09 01:58:52.895954 W | etcdserver: read-only range request “key:“foo” " with result “error:context canceled” took too long (4.978077719s) to execute
WARNING: 2020/03/09 01:58:52 grpc: Server.processUnaryRPC failed to write status: connection error: desc = “transport is closing”
2020-03-09 01:58:56.017735 W | etcdserver: timed out waiting for read index response (local node might have slow network)
2020-03-09 01:58:56.017842 W | etcdserver: read-only range request “key:”/registry/namespaces/default” " with result “error:etcdserver: request timed out” took too long (9.076373297s) to execute
2020-03-09 01:58:56.017901 W | etcdserver: read-only range request “key:”/registry/jobs/" range_end:"/registry/jobs0" limit:500 " with result “error:etcdserver: request timed out” took too long (9.043044133s) to execute
2020-03-09 01:58:56.018687 W | etcdserver: read-only range request “key:”/registry/controllers" range_end:"/registry/controllert" count_only:true " with result “range_response_count:0 size:5” took too long (6.214095461s) to execute
2020-03-09 01:58:56.018818 W | etcdserver: read-only range request “key:”/registry/services/endpoints/kube-system/kube-scheduler" " with result “range_response_count:1 size:432” took too long (4.996106295s) to execute
2020-03-09 01:58:56.018846 W | etcdserver: read-only range request “key:”/registry/services/endpoints" range_end:"/registry/services/endpointt" count_only:true " with result “range_response_count:0 size:7” took too long (3.245291313s) to execute
2020-03-09 02:01:06.971193 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 11.438453ms, to fb0abe4093c0e7f7)
2020-03-09 02:01:06.971233 W | etcdserver: server is likely overloaded
2020-03-09 02:01:06.971245 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 11.500831ms, to 8854dbc00315537a)
2020-03-09 02:01:06.971251 W | etcdserver: server is likely overloaded
2020-03-09 02:03:12.706388 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.748602ms, to fb0abe4093c0e7f7)
2020-03-09 02:03:12.706417 W | etcdserver: server is likely overloaded
2020-03-09 02:03:12.706428 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 49.80079ms, to 8854dbc00315537a)
2020-03-09 02:03:12.706434 W | etcdserver: server is likely overloaded
2020-03-09 02:05:07.917140 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 50.942364ms, to fb0abe4093c0e7f7)
2020-03-09 02:05:07.917161 W | etcdserver: server is likely overloaded
2020-03-09 02:05:07.917168 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 50.979696ms, to 8854dbc00315537a)
2020-03-09 02:05:07.917172 W | etcdserver: server is likely overloaded
2020-03-09 02:07:32.549048 I | mvcc: store.index: compact 1356
2020-03-09 02:07:32.564837 I | mvcc: finished scheduled compaction at 1356 (took 15.359479ms)
2020-03-09 02:12:32.556240 I | mvcc: store.index: compact 1983
2020-03-09 02:12:32.569738 I | mvcc: finished scheduled compaction at 1983 (took 13.101881ms)
2020-03-09 02:15:10.053828 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 197.265981ms, to fb0abe4093c0e7f7)
2020-03-09 02:15:10.053849 W | etcdserver: server is likely overloaded
2020-03-09 02:15:10.053857 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 197.299549ms, to 8854dbc00315537a)
2020-03-09 02:15:10.053860 W | etcdserver: server is likely overloaded

请将所有节点时间调至统一,然后重启docker即可

感谢解答。但目前的问题应该还不是出在这儿,重新做了时间同步,还是目前的状态。不知是否还有其他思路?

你好,可尝试刷新集群证书后重启docker

ansible-playbook -i inventory.ini 95-certificates-renew.yml
systemctl restaer docker

问题目前解决了。集群网络插件由flannel换成了calico。现在怀疑是网络MTU设置的问题,后面会详细测试一下。感谢TimeBye的耐心支持。

好的,期待你的分享,谢谢 :grinning:

您好,我也碰到了类似的问题,请问您清楚flannel具体有什么问题吗?