K8S组件calico重建过程

news/2024/8/26 11:12:04 标签: kubernetes, linux, 网络

问题背景:周一来了以后看到calico-node组件pod重启100多次,查看日志发现warning日志:

Number of node(s) with BGP peering established = 2 calico/node is not ready: felix is not ready: Get “http://localhost:9099/readiness”: dial tcp [::1]:9099: connect: connection refused

一、问题日志

  • 频繁重启

[root@master ~]# kubectl get pods -n calico-system -o wide 
NAMESPACE              NAME                                         READY   STATUS    RESTARTS          AGE     IP               NODE     NOMINATED NODE   READINESS GATES
aliang-cka             web-5dc86dfc-t7nrb                           1/1     Running   0                 2d16h   10.244.140.72    node02   <none>           <none>
calico-apiserver       calico-apiserver-bb689689-b5v88              1/1     Running   0                 2d19h   10.244.196.131   node01   <none>           <none>
calico-apiserver       calico-apiserver-bb689689-dwlf4              1/1     Running   0                 2d19h   10.244.140.66    node02   <none>           <none>
calico-system          calico-kube-controllers-58d9bdcc64-tfqgx     1/1     Running   0                 2d23h   10.244.219.65    master   <none>           <none>
calico-system          calico-node-dr6ch                            1/1     Running   128 (64m ago)     2d23h   192.168.0.12     node01   <none>           <none>
calico-system          calico-node-lj89c                            1/1     Running   140 (2m44s ago)   2d23h   192.168.0.13     node02   <none>           <none>
calico-system          calico-node-vrz58                            1/1     Running   138 (45s ago)     2d23h   192.168.0.11     master   <none>           <none>
calico-system          calico-typha-578cfdc69-95f9b                 1/1     Running   167 (2s ago)      2d23h   192.168.0.13     node02   <none>           <none>
calico-system          calico-typha-578cfdc69-zhffj                 1/1     Running   121 (108m ago)    2d23h   192.168.0.12     node01   <none>           <none>
calico-system          csi-node-driver-5ntdf                        2/2     Running   0                 2d23h   10.244.219.68    master   <none>           <none>
calico-system          csi-node-driver-9psnp                        2/2     Running   0                 2d23h   10.244.140.65    node02   <none>           <none>
calico-system          csi-node-driver-fz67c                        2/2     Running   0                 2d23h   10.244.196.129   node01   <none>           <none>

  • calico-node Events日志

Events:
  Type     Reason     Age   From     Message
  ----     ------     ----  ----     -------
  Warning  Unhealthy  23m   kubelet  Readiness probe failed: 2024-07-15 01:27:04.839 [INFO][3310] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  23m  kubelet  Readiness probe failed: 2024-07-15 01:27:14.839 [INFO][3320] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  20m  kubelet  Readiness probe failed: 2024-07-15 01:30:24.839 [INFO][3553] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  16m  kubelet  Readiness probe failed: 2024-07-15 01:34:44.839 [INFO][3867] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  9m (x1666 over 2d18h)    kubelet  Liveness probe failed: Get "http://localhost:9099/liveness": dial tcp [::1]:9099: connect: connection refused
  Warning  Unhealthy  110s (x3936 over 2d18h)  kubelet  (combined from similar events): Readiness probe failed: 2024-07-15 01:49:04.836 [INFO][4911] confd/health.go 180: Number of node(s) with BGP peering established = 2

二、解决办法:

  • 1.完全删除calico-node pod服务。

# 在master节点执行删除calico相关pod service,deployment namespace
kubectl delete -f tigera-operator.yaml
kubectl delete -f custom-resources.yaml

# 以上命令执行后如果发现有Error返回,检查calico相关pod service,deployment namespace,手动删除,即删除calico-system命名空间下的所有服务
 kubectl delete pod -n calico-system csi-node-driver-jhdvh csi-node-driver-9nmrb csi-node-driver-2w8p8 calico-node-x7spm calico-node-8z8rm calico-node-78ffv 
 kubectl delete deployment -n  calico-system  calico-typha calico-kube-controllers 
 kubectl delete deployment -n  calico-apiserver calico-apiserver
 
 kubectl delete svc -n  calico-system  calico-typha calico-kube-controllers 
 kubectl delete svc -n  calico-apiserver calico-apiserver
 
 kubectl delete ns calico-apiserver
 kubectl delete ns calico-system
 
# 不出意外的情况下,在删除calico-system 命名空间的时候会删不掉,calico-system状态变成了Terminating
[root@master ]# kubectl get ns -A
NAME                   STATUS        AGE
calico-system          Terminating   3d1h
default                Active        3d1h
kube-node-lease        Active        3d1h
kube-public            Active        3d1h
kube-system            Active        3d1h
kubernetes-dashboard   Active        2d19h

# 删不掉的解决办法:
# 1.先导出配置文件
kubectl get ns calico-system -o json > tmp.json

# 2.修改导出文件,删除其中的finalizers这一项,其他不变,然后保存。
....
        "resourceVersion": "624892",
        "uid": "fa96ef83-497e-4bc7-a98a-39660e90fd32"
    },
    "spec": {
        "finalizers": [   # 删除这个finalizers数组
            "kubernetes"  
        ]
    },
    "status": {
        "phase": "Active"
    }
}
....

# 3.在当前终端开启代理 kubectl proxy
[root@master ]# kubectl proxy
Starting to serve on 127.0.0.1:8001

# 4.再开一个终端,通过curl调用api删除,无输出
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/calico-system/finalize

# 5.再次查看namespace,calico-system被删掉了。
[root@master ~]# kubectl get ns -A
NAME                   STATUS   AGE
default                Active   3d1h
kube-node-lease        Active   3d1h
kube-public            Active   3d1h
kube-system            Active   3d1h
kubernetes-dashboard   Active   2d19h

# 6.将所有节点的/etc/cni/net.d/目录清空,然后重启所有节点的kubelet
rm -rf /etc/cni/net.d/*
systemctl restart kubelet

# 7.coredns的pod将会重启变成pending状态,calico删除完成!
  • 2.重建calico组件

# 2.1重建之前检查各个节点的时间同步情况,没有同步的一定要先同步
ntpdate ntp.aliyun.com

# 2.2重建calico服务
# 下载 
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/tigera-operator.yaml
wget https://raw.githubusercontent.com/projectcalico/calico/v3.25.1/manifests/custom-resources.yaml

# 修改custom-resources.yaml文件中 CIDR,默认是 192.168.0.0/16,修改为创建集群时的IP段,
# 我这里创建集群时用的 10.244.0.0/16,若与集群IP段与官网配置文件一直,则无需修改。
....
calicoNetwork:
    # Note: The ipPools section cannot be modified post-install.
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16  # 修改此处
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
....

# 执行calico部署文件
kubectl create -f tigera-operator.yaml
kubectl create -f custom-resources.yaml

# 等待pod启动,如果之前镜像没有删除的话,重建会比较快的,否则会重新拉取镜像,比较耗时。
# 重建完成
[root@master calico-operator]# kubectl get pods -n calico-system
NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-58d9bdcc64-vzm9r   1/1     Running   0          5m15s
calico-node-5p7qf                          1/1     Running   0          5m16s
calico-node-9lnmn                          1/1     Running   0          5m16s
calico-node-hpxdr                          1/1     Running   0          5m16s
calico-typha-65b4547c94-46fll              1/1     Running   0          5m8s
calico-typha-65b4547c94-qb2tx              1/1     Running   0          5m16s
csi-node-driver-jrx88                      2/2     Running   0          5m16s
csi-node-driver-kw6d6                      2/2     Running   0          5m16s
csi-node-driver-wdhk7                      2/2     Running   0          5m16s

http://www.niftyadmin.cn/n/5558007.html

相关文章

提高自动化测试脚本编写效率 5大关键注意事项

提高自动化测试脚本编写效率能加速测试周期&#xff0c;减少人工错误&#xff0c;提升软件质量&#xff0c;促进项目按时交付&#xff0c;增强团队生产力和项目成功率。而自动化测试脚本编写效率低下&#xff0c;往往会导致测试周期延长&#xff0c;增加项目成本&#xff0c;延…

[k8s源码]4.informer

Informer 是 client-go 库中的一个核心组件,它提供了一种高效的方式来监视 Kubernetes 集群中资源的变化。Informer 通过 Watch 机制与 API Server 建立长连接&#xff0c;初次同步时会获取资源的完整列表&#xff0c;之后只接收增量更新,大大减少了网络流量。 使用informer可…

如何在VSCode中配置Python环境

在Visual Studio Code&#xff08;VSCode&#xff09;中配置Python环境&#xff0c;主要包括安装VSCode、安装Python解释器、安装Python插件以及配置Python解释器等步骤。以下是详细的配置指南&#xff1a; 一、安装VSCode 下载VSCode&#xff1a; 访问VSCode官网。根据自己的…

计算机视觉和自然语言处理:OCR 模型

OCR 模型 文字识别&#xff08;Optical Character Recognition&#xff0c;OCR&#xff09;模型是一种用来从图像中提取文本的技术。OCR模型在计算机视觉和自然语言处理中的应用非常广泛&#xff0c;例如将扫描的文档转换为可编辑的文本文件&#xff0c;自动读取车牌号码&…

opencv—常用函数学习_“干货“_10

目录 二七、离散余弦变换 执行离散余弦变换 (dct) 和逆变换 (idct) 解释 实际应用 JPEG压缩示例&#xff08;简化版&#xff09; 二八、图像几何变换 仿射变换 (warpAffine 和 getAffineTransform) 透视变换 (warpPerspective 和 getPerspectiveTransform) 旋转变换 (g…

Mongodb多键索引中索引边界的混合

学习mongodb&#xff0c;体会mongodb的每一个使用细节&#xff0c;欢迎阅读威赞的文章。这是威赞发布的第93篇mongodb技术文章&#xff0c;欢迎浏览本专栏威赞发布的其他文章。如果您认为我的文章对您有帮助或者解决您的问题&#xff0c;欢迎在文章下面点个赞&#xff0c;或者关…

Chromium CI/CD 之Jenkins实用指南2024- 发送任务到Ubuntu(五)

1. 引言 在前一篇《Chromium CI/CD 之 Jenkins - 创建任务&#xff08;四&#xff09;》中&#xff0c;我们详细介绍了如何在Jenkins中创建和配置新任务&#xff0c;包括设置任务名称、选择运行节点、配置触发器、编写执行脚本以及添加文件收集步骤。通过这些步骤&#xff0c;…

【Mamba】Mamba的部署

ubuntu系统安装11.6版本的cuda 可以参考这两篇博客 ubuntu22.04多版本安装cuda及快速切换&#xff08;cuda11.1和11.8&#xff09;_ubuntu调整cuda版本 【Linux】在一台机器上同时安装多个版本的CUDA&#xff08;切换CUDA版本&#xff09;_linux安装多个cuda 安装CUDA https…