c7n基础服务自动重启

catcallxw · 2020 年11 月 2 日 02:00

基础服务平台因为mysql数据库连接不上自动重启，等待mysql重启后，服务会重启成功，但是过一段时间又会自动重启。
日志如下：manager-service日志如下：（其他服务类同）
2020-11-02 07:38:23.224 WARN 7 — [ XNIO-2 task-4] o.s.b.a.jdbc.DataSourceHealthIndicator : DataSource health check failed

org.springframework.jdbc.CannotGetJdbcConnectionException: Failed to obtain JDBC Connection; nested exception is java.sql.SQLTransientConnectionException: HikariPool-1 - Connection is not available, request timed out after 30000ms.
at org.springframework.jdbc.datasource.DataSourceUtils.getConnection(DataSourceUtils.java:81) ~[spring-jdbc-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:323) ~[spring-jdbc-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
at org.springframework.boot.actuate.jdbc.DataSourceHealthIndicator.getProduct(DataSourceHealthIndicator.java:119) ~[spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.jdbc.DataSourceHealthIndicator.doDataSourceHealthCheck(DataSourceHealthIndicator.java:107) ~[spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.jdbc.DataSourceHealthIndicator.doHealthCheck(DataSourceHealthIndicator.java:102) ~[spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.health.AbstractHealthIndicator.health(AbstractHealthIndicator.java:82) ~[spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.health.CompositeHealthIndicator.health(CompositeHealthIndicator.java:95) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.health.HealthEndpoint.health(HealthEndpoint.java:50) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.health.HealthEndpointWebExtension.health(HealthEndpointWebExtension.java:53) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at sun.reflect.GeneratedMethodAccessor168.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[na:1.8.0_242]
at java.lang.reflect.Method.invoke(Unknown Source) ~[na:1.8.0_242]
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:282) [spring-core-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
at org.springframework.boot.actuate.endpoint.invoke.reflect.ReflectiveOperationInvoker.invoke(ReflectiveOperationInvoker.java:76) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.endpoint.annotation.AbstractDiscoveredOperation.invoke(AbstractDiscoveredOperation.java:60) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.endpoint.web.servlet.AbstractWebMvcEndpointHandlerMapping$ServletWebOperationAdapter.handle(AbstractWebMvcEndpointHandlerMapping.java:278) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at org.springframework.boot.actuate.endpoint.web.servlet.AbstractWebMvcEndpointHandlerMapping$OperationHandler.handle(AbstractWebMvcEndpointHandlerMapping.java:334) [spring-boot-actuator-2.1.6.RELEASE.jar!/:2.1.6.RELEASE]
at sun.reflect.GeneratedMethodAccessor167.invoke(Unknown Source) ~[na:na]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) ~[na:1.8.0_242]
at java.lang.reflect.Method.invoke(Unknown Source) ~[na:1.8.0_242]
at org.springframework.web.method.support.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:190) [spring-web-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]
at org.springframework.web.method.support.InvocableHandlerMethod.invokeForRequest(InvocableHandlerMethod.java:138) [spring-web-5.1.8.RELEASE.jar!/:5.1.8.RELEASE]

mysql日志如下：
2020-11-01T23:46:20.026384Z 3441 [Note] Aborted connection 3441 to db: ‘asgard_service’ user: ‘choerodon’ host: ‘192.168.2.14’ (Got an error reading communication packets)

2020-11-01T23:46:20.026614Z 3462 [Note] Aborted connection 3462 to db: ‘notify_service’ user: ‘choerodon’ host: ‘192.168.2.15’ (Got an error reading communication packets)

2020-11-01T23:46:20.026742Z 3444 [Note] Aborted connection 3444 to db: ‘test_manager_service’ user: ‘choerodon’ host: ‘192.168.2.35’ (Got an error reading communication packets)

2020-11-01T23:46:20.026806Z 3464 [Note] Aborted connection 3464 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.2.31’ (Got an error reading communication packets)

2020-11-01T23:46:20.026943Z 3511 [Note] Aborted connection 3511 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.0.134’ (Got an error reading communication packets)

2020-11-01T23:46:20.027151Z 3516 [Note] Aborted connection 3516 to db: ‘workflow_service’ user: ‘choerodon’ host: ‘192.168.1.58’ (Got an error reading communication packets)

2020-11-01T23:46:20.027236Z 3530 [Note] Aborted connection 3530 to db: ‘workflow_service’ user: ‘choerodon’ host: ‘192.168.1.58’ (Got an error reading communication packets)

2020-11-01T23:46:20.027461Z 3533 [Note] Aborted connection 3533 to db: ‘asgard_service’ user: ‘choerodon’ host: ‘192.168.2.14’ (Got an error reading communication packets)

2020-11-01T23:46:20.027662Z 3517 [Note] Aborted connection 3517 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.2.31’ (Got an error reading communication packets)

2020-11-01T23:46:20.027744Z 3535 [Note] Aborted connection 3535 to db: ‘agile_service’ user: ‘choerodon’ host: ‘192.168.1.44’ (Got an error reading communication packets)

2020-11-01T23:46:20.027956Z 3544 [Note] Aborted connection 3544 to db: ‘knowledgebase_service’ user: ‘choerodon’ host: ‘192.168.1.52’ (Got an error reading communication packets)

2020-11-01T23:46:20.028179Z 3542 [Note] Aborted connection 3542 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.3.74’ (Got an error reading communication packets)

2020-11-01T23:46:20.028351Z 3538 [Note] Aborted connection 3538 to db: ‘devops_service’ user: ‘choerodon’ host: ‘192.168.2.36’ (Got an error reading communication packets)

2020-11-01T23:46:20.028593Z 3468 [Note] Aborted connection 3468 to db: ‘test_manager_service’ user: ‘choerodon’ host: ‘192.168.2.35’ (Got an error reading communication packets)

2020-11-01T23:46:20.028635Z 3548 [Note] Aborted connection 3548 to db: ‘notify_service’ user: ‘choerodon’ host: ‘192.168.2.15’ (Got an error reading communication packets)

2020-11-01T23:46:20.028899Z 3553 [Note] Aborted connection 3553 to db: ‘manager_service’ user: ‘choerodon’ host: ‘192.168.2.11’ (Got an error reading communication packets)

2020-11-01T23:46:20.029184Z 3547 [Note] Aborted connection 3547 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.2.31’ (Got an error reading communication packets)

2020-11-01T23:46:20.029372Z 3564 [Note] Aborted connection 3564 to db: ‘manager_service’ user: ‘choerodon’ host: ‘192.168.2.11’ (Got an error reading communication packets)

2020-11-01T23:46:20.029668Z 3554 [Note] Aborted connection 3554 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.0.134’ (Got an error reading communication packets)

2020-11-01T23:46:20.029932Z 3370 [Note] Aborted connection 3370 to db: ‘base_service’ user: ‘choerodon’ host: ‘192.168.0.134’ (Got an error reading communication packets)

2020-11-01T23:46:20.030138Z 3385 [Note] Aborted connection 3385 to db: ‘workflow_service’ user: ‘choerodon’ host: ‘192.168.1.58’ (Got an error reading communication packets)

2020-11-01T23:46:20.030255Z 3373 [Note] Aborted connection 3373 to db: ‘test_manager_service’ user: ‘choerodon’ host: ‘192.168.2.35’ (Got an error reading communication packets)

2020-11-01T23:46:20.030469Z 3351 [Note] Aborted connection 3351 to db: ‘workflow_service’ user: ‘choerodon’ host: ‘192.168.1.58’ (Got an error reading communication packets)

2020-11-01T23:46:20.215889Z 2943 [Note] Aborted connection 2943 to db: ‘devops_service’ user: ‘choerodon’ host: ‘192.168.2.36’ (Got an error reading communication packets)

2020-11-01T23:46:20.416099Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 938478ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.)

2020-11-01T23:46:36.956603Z 3589 [Warning] InnoDB: Retry attempts for reading partial data failed.

2020-11-01T23:46:36.956639Z 3589 [ERROR] InnoDB: Tried to read 16384 bytes at offset 49152, but was only able to read 0

2020-11-01T23:46:36.956648Z 3589 [ERROR] InnoDB: Operating system error number 5 in a file operation.

2020-11-01T23:46:36.956664Z 3589 [ERROR] InnoDB: Error number 5 means ‘Input/output error’

2020-11-01T23:46:36.956668Z 3589 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html

2020-11-01T23:46:36.956673Z 3589 [ERROR] InnoDB: File (unknown): ‘read’ returned OS error 105. Cannot continue operation

2020-11-01T23:46:36.956677Z 3589 [ERROR] InnoDB: Cannot continue operation.

2020-11-01T23:46:39.198976Z 3633 [Warning] InnoDB: Retry attempts for reading partial data failed.

2020-11-01T23:46:39.199000Z 3633 [ERROR] InnoDB: Tried to read 16384 bytes at offset 65536, but was only able to read 0

2020-11-01T23:46:39.199005Z 3633 [ERROR] InnoDB: Operating system error number 5 in a file operation.

2020-11-01T23:46:39.199010Z 3633 [ERROR] InnoDB: Error number 5 means ‘Input/output error’

2020-11-01T23:46:39.199014Z 3633 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html

2020-11-01T23:46:39.199019Z 3633 [ERROR] InnoDB: File (unknown): ‘read’ returned OS error 105. Cannot continue operation

2020-11-01T23:46:39.199023Z 3633 [ERROR] InnoDB: Cannot continue operation.

2020-11-01T23:46:39.949205Z 0 [Note] InnoDB: FTS optimize thread exiting.

2020-11-01T23:47:16.096242Z 3669 [Warning] InnoDB: Retry attempts for reading partial data failed.

2020-11-01T23:47:16.096287Z 3669 [ERROR] InnoDB: Tried to read 16384 bytes at offset 8257536, but was only able to read 0

2020-11-01T23:47:16.096295Z 3669 [ERROR] InnoDB: Operating system error number 5 in a file operation.

2020-11-01T23:47:16.096301Z 3669 [ERROR] InnoDB: Error number 5 means ‘Input/output error’

2020-11-01T23:47:16.096305Z 3669 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html

2020-11-01T23:47:16.096313Z 3669 [ERROR] InnoDB: File (unknown): ‘read’ returned OS error 105. Cannot continue operation

2020-11-01T23:47:16.096327Z 3669 [ERROR] InnoDB: Cannot continue operation.

ChangpingShi · 2020 年11 月 2 日 03:03

建议找找mysql的问题吧，搜索下 Aborted connection 3441 to db 网上答案挺多的。
http://www.voidcn.com/article/p-bfdultwq-rs.html

catcallxw · 2020 年11 月 2 日 03:10

是所有服务都存在这个连接不上，是因为这一个manger造成的吗？谢谢回复

Vista · 2020 年11 月 2 日 03:10

你把所有节点的内核升级到4.x（4.4 或者4.20）

ChangpingShi · 2020 年11 月 2 日 03:12

manager也只连接了manager-service这一个数据库 manger主要是刷接口权限和mysql连不上没有关系

Vista · 2020 年11 月 2 日 03:27

你看一下这个pod 的日志

kubectl logs -n kube-system c7nfs-client-provisioner-xx

catcallxw · 2020 年11 月 2 日 05:07

nfs-client-provisioner-xx 这个服务，我们装在了default空间下有影响吗？现在如果我重启这个服务，会对线上环境造成影响吗

catcallxw · 2020 年11 月 2 日 05:07

现在整个nfs服务死掉了，重启也没用，挂载的nfs地址读不出来

catcallxw · 2020 年11 月 2 日 05:07

10.206.0.8:/ on /data type nfs4 (rw,relatime,vers=4.0,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.206.0.12,local_lock=none,addr=10.206.0.8)

catcallxw · 2020 年11 月 2 日 05:08

这个服务没有重启过

catcallxw · 2020 年11 月 2 日 05:08

I1030 15:25:04.002016 I1030 15:25:31.994965 I1030 15:25:31.995191 I1030 15:25:31.995196 I1030 15:25:32.095383 E1030 15:30:21.217564 E1030 16:10:38.061597 I1030 19:49:55.147441 I1030 19:49:55.183677 I1030 19:49:55.195328 I1030 19:49:55.195352 E1031 10:56:55.664591 I1101 18:36:49.830349 I1101 18:36:49.841343 E1101 18:45:35.961650 I1101 18:49:40.904017 I1101 18:49:40.904068 I1101 18:49:40.948111 I1101 18:49:40.948142 I1101 18:49:40.948174 I1101 18:49:40.956352 I1101 18:49:41.012967 I1101 18:49:41.037019 I1101 18:49:41.037035 1 leaderelection.go:185] attempting to acquire leader lease default/choerodon.io-nfs-client-provisioner…
1 leaderelection.go:194] successfully acquired lease default/choerodon.io-nfs-client-provisioner
1 event.go:221] Event(v1.ObjectReference{Kind:“Endpoints”, Namespace:“default”, Name:“choerodon.io-nfs-client-provisioner”, UID:“89a1b620-f994-48e4-b701-ac376562382a”, APIVersion:“v1”, ResourceVersion:“48175704”, FieldPath:""}): type: ‘Normal’ reason: ‘LeaderElection’ nfs-client-provisioner-cf46db6b7-l59nz_15688759-1ac4-11eb-88a4-4653d0feab55 became leader
1 controller.go:631] Starting provisioner controller choerodon.io/nfs-client-provisioner_nfs-client-provisioner-cf46db6b7-l59nz_15688759-1ac4-11eb-88a4-4653d0feab55!
1 controller.go:680] Started provisioner controller choerodon.io/nfs-client-provisioner_nfs-client-provisioner-cf46db6b7-l59nz_15688759-1ac4-11eb-88a4-4653d0feab55!
1 leaderelection.go:268] Failed to update lock: etcdserver: request timed out
1 leaderelection.go:268] Failed to update lock: etcdserver: request timed out
1 controller.go:1158] delete “pvc-752f0db4-2768-4a69-87bd-e09a6e6b5666”: started
1 controller.go:1186] delete “pvc-752f0db4-2768-4a69-87bd-e09a6e6b5666”: volume deleted
1 controller.go:1196] delete “pvc-752f0db4-2768-4a69-87bd-e09a6e6b5666”: persistentvolume deleted
1 controller.go:1198] delete “pvc-752f0db4-2768-4a69-87bd-e09a6e6b5666”: succeeded
1 leaderelection.go:268] Failed to update lock: etcdserver: request timed out
1 controller.go:987] provision “default/myclaim” class “nfs-provisioner”: started
1 event.go:221] Event(v1.ObjectReference{Kind:“PersistentVolumeClaim”, Namespace:“default”, Name:“myclaim”, UID:“d572586c-f364-479e-bc42-bd6c566442b1”, APIVersion:“v1”, ResourceVersion:“48734349”, FieldPath:""}): type: ‘Normal’ reason: ‘Provisioning’ External provisioner is provisioning volume for claim “default/myclaim”
1 controller.go:816] claim “default/myclaim” in work queue no longer exists
1 controller.go:1087] provision “default/myclaim” class “nfs-provisioner”: volume “pvc-d572586c-f364-479e-bc42-bd6c566442b1” provisioned
1 controller.go:1101] provision “default/myclaim” class “nfs-provisioner”: trying to save persistentvvolume “pvc-d572586c-f364-479e-bc42-bd6c566442b1”
1 controller.go:1108] provision “default/myclaim” class “nfs-provisioner”: persistentvolume “pvc-d572586c-f364-479e-bc42-bd6c566442b1” saved
1 controller.go:1149] provision “default/myclaim” class “nfs-provisioner”: succeeded
1 event.go:221] Event(v1.ObjectReference{Kind:“PersistentVolumeClaim”, Namespace:“default”, Name:“myclaim”, UID:“d572586c-f364-479e-bc42-bd6c566442b1”, APIVersion:“v1”, ResourceVersion:“48734349”, FieldPath:""}): type: ‘Normal’ reason: ‘ProvisioningSucceeded’ Successfully provisioned volume pvc-d572586c-f364-479e-bc42-bd6c566442b1
1 controller.go:1158] delete “pvc-d572586c-f364-479e-bc42-bd6c566442b1”: started
1 controller.go:1186] delete “pvc-d572586c-f364-479e-bc42-bd6c566442b1”: volume deleted
1 controller.go:1196] delete “pvc-d572586c-f364-479e-bc42-bd6c566442b1”: persistentvolume deleted
1 controller.go:1198] delete “pvc-d572586c-f364-479e-bc42-bd6c566442b1”: succeeded

catcallxw · 2020 年11 月 2 日 05:17

现在nfs又可以了，如果我把nfs挂载下的文件都删除，启动猪池鱼会对其他环境造成影响吗？

Vista · 2020 年11 月 2 日 06:05

别啊，mysql 和gitlab 等数据都放在这里。

catcallxw · 2020 年11 月 2 日 06:41

showmount -e ip地址，就卡死了，这个是不是nfs服务有问题

catcallxw · 2020 年11 月 2 日 06:46

Client nfs v4:
null read write commit open open_conf
0 0% 18611482 2% 10903576 1% 0 0% 207442594 22% 55361 0%
open_noat open_dgrd close setattr fsinfo renew
26736 0% 2435395 0% 204985855 22% 7506667 0% 4385 0% 40010 0%
setclntid confirm lock lockt locku access
39 0% 39 0% 1204025 0% 0 0% 1199631 0% 12716773 1%
getattr lookup lookup_root remove rename link
356976610 38% 76418206 8% 2193 0% 8770401 0% 2197936 0% 0 0%
symlink create pathconf statfs readlink readdir
0 0% 32054 0% 2193 0% 2071751 0% 1997 0% 7174384 0%
server_caps delegreturn getacl setacl fs_locations rel_lkowner
6578 0% 0 0% 0 0% 0 0% 0 0% 1198291 0%
secinfo exchange_id create_ses destroy_ses sequence get_lease_t
0 0% 0 0% 2191 0% 0 0% 0 0% 0 0%
reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist
0 0% 0 0% 0 0% 0 0% 0 0% 0 0%
(null)
0 0%

catcallxw · 2020 年11 月 2 日 06:46

感觉是我们频繁读写，造成了nfs堵塞了，这个问题是不是因为nfs挂载的协议版本的问题？

Vista · 2020 年11 月 2 日 07:05

是的，3.10.x Linux内核对 nfs v4的支持不够完善，容易导致卡死。所以建议升级4.x

catcallxw · 2020 年11 月 2 日 08:52

我可以先把nfs-client-provisioner-xx 服务用helm删除再添加一个吗？对其他应用有影响吗？我现在已经把猪池鱼的基础服务停了

catcallxw · 2020 年11 月 2 日 08:52

如果我把nfs降到3.x或者升级到4.1是不是可以解决问题，这个nfs升级，会对集群内的应用有影响吗？