0.24版本 choerodon-iam 服务启动四五天之后就会报OOM

  • Choerodon平台版本:0.24.0

  • 运行环境:自主搭建

  • 问题描述:
    choerodon-iam 服务启动几天后提示OOM ,核心日志如下:

2021-10-27 09:25:37.951 WARN 8 — [ool-26-thread-1] i.c.a.common.AbstractAsgardConsumer : error.asgard.scheduleRunning, msg: ScheduleConsumerClient#pollBatch(PollScheduleInstanceDTO) failed and encountered unrecoverable error.
com.netflix.hystrix.exception.HystrixRuntimeException: ScheduleConsumerClient#pollBatch(PollScheduleInstanceDTO) failed and encountered unrecoverable error.
at com.netflix.hystrix.AbstractCommand.getFallbackOrThrowException(AbstractCommand.java:769)
at com.netflix.hystrix.AbstractCommand.handleFailureViaFallback(AbstractCommand.java:1034)
at com.netflix.hystrix.AbstractCommand.access$700(AbstractCommand.java:60)
at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:622)
at com.netflix.hystrix.AbstractCommand$12.call(AbstractCommand.java:601)
at rx.internal.operators.OperatorOnErrorResumeNextViaFunction$4.onError(OperatorOnErrorResumeNextViaFunction.java:140)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at com.netflix.hystrix.AbstractCommand$HystrixObservableTimeoutOperator$2.onError(AbstractCommand.java:1194)
at rx.internal.operators.OperatorSubscribeOn$SubscribeOnSubscriber.onError(OperatorSubscribeOn.java:80)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeDoOnEach$DoOnEachSubscriber.onError(OnSubscribeDoOnEach.java:87)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at com.netflix.hystrix.AbstractCommand$DeprecatedOnRunHookApplication$1.onError(AbstractCommand.java:1431)
at com.netflix.hystrix.AbstractCommand$ExecutionHookApplication$1.onError(AbstractCommand.java:1362)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.observers.Subscribers$5.onError(Subscribers.java:230)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:44)
at rx.internal.operators.OnSubscribeThrow.call(OnSubscribeThrow.java:28)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:51)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:35)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:41)
at rx.internal.operators.OnSubscribeDoOnEach.call(OnSubscribeDoOnEach.java:30)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
at rx.Observable.unsafeSubscribe(Observable.java:10327)
at rx.internal.operators.OperatorSubscribeOn$SubscribeOnSubscriber.call(OperatorSubscribeOn.java:100)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction$1.call(HystrixContexSchedulerAction.java:56)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction$1.call(HystrixContexSchedulerAction.java:47)
at org.hzero.core.hystrix.AbstractCallable.call(AbstractCallable.java:15)
at org.hzero.core.hystrix.RequestAttributeCallableWrapper$RequestAttributeCallable.call(RequestAttributeCallableWrapper.java:35)
at com.netflix.hystrix.strategy.concurrency.HystrixContexSchedulerAction.call(HystrixContexSchedulerAction.java:69)
at rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:823)
Caused by: java.lang.Exception: Throwable caught while executing.
at com.netflix.hystrix.AbstractCommand.getExceptionFromThrowable(AbstractCommand.java:1976)
at com.netflix.hystrix.AbstractCommand.wrapWithOnExecutionErrorHook(AbstractCommand.java:1490)
at com.netflix.hystrix.AbstractCommand.access$1300(AbstractCommand.java:60)
at com.netflix.hystrix.AbstractCommand$ExecutionHookApplication$1.onError(AbstractCommand.java:1361)
… 34 more
Caused by: java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 11
at java.lang.Thread.startImpl(Native Method)
at java.lang.Thread.start(Thread.java:993)
at sun.net.www.http.KeepAliveCache$1.run(KeepAliveCache.java:112)
at sun.net.www.http.KeepAliveCache$1.run(KeepAliveCache.java:96)
at java.security.AccessController.doPrivileged(AccessController.java:678)
at sun.net.www.http.KeepAliveCache.put(KeepAliveCache.java:95)
at sun.net.www.http.HttpClient.putInKeepAliveCache(HttpClient.java:438)
at sun.net.www.http.HttpClient.finished(HttpClient.java:395)
at sun.net.www.http.ChunkedInputStream.closeUnderlying(ChunkedInputStream.java:219)
at sun.net.www.http.ChunkedInputStream.processRaw(ChunkedInputStream.java:455)
at sun.net.www.http.ChunkedInputStream.readAheadBlocking(ChunkedInputStream.java:572)
at sun.net.www.http.ChunkedInputStream.readAhead(ChunkedInputStream.java:609)
at sun.net.www.http.ChunkedInputStream.read(ChunkedInputStream.java:696)
at java.io.FilterInputStream.read(FilterInputStream.java:133)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3454)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3447)
at sun.net.www.protocol.http.HttpURLConnection$HttpInputStream.read(HttpURLConnection.java:3435)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at org.springframework.web.client.MessageBodyClientHttpResponseWrapper.hasEmptyMessageBody(MessageBodyClientHttpResponseWrapper.java:96)
at org.springframework.web.client.HttpMessageConverterExtractor.extractData(HttpMessageConverterExtractor.java:85)
at org.springframework.cloud.openfeign.support.SpringDecoder.decode(SpringDecoder.java:60)
at org.springframework.cloud.openfeign.support.ResponseEntityDecoder.decode(ResponseEntityDecoder.java:45)
at feign.optionals.OptionalDecoder.decode(OptionalDecoder.java:36)
at feign.SynchronousMethodHandler.decode(SynchronousMethodHandler.java:170)
at feign.SynchronousMethodHandler.executeAndDecode(SynchronousMethodHandler.java:134)
at feign.SynchronousMethodHandler.invoke(SynchronousMethodHandler.java:77)
at feign.hystrix.HystrixInvocationHandler$1.run(HystrixInvocationHandler.java:107)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:302)
at com.netflix.hystrix.HystrixCommand$2.call(HystrixCommand.java:298)
at rx.internal.operators.OnSubscribeDefer.call(OnSubscribeDefer.java:46)
… 28 more
2021-10-27 09:25:41.272 INFO 8 — [freshExecutor-0] com.netflix.discovery.DiscoveryClient : Disable delta property : true

猜想是这部分配置有问题:
hystrix:
threadpool:
default:
# 执行命令线程池的核心线程数,也是命令执行的最大并发量
# 默认10
coreSize: 1000
# 最大执行线程数
maximumSize: 1000
command:
default:
execution:
isolation:
thread:
# HystrixCommand 执行的超时时间,超时后进入降级处理逻辑。一个接口,理论的最佳响应速度应该在200ms以内,或者慢点的接口就几百毫秒。
# 默认 1000 毫秒,最高设置 2000足矣。如果超时,首先看能不能优化接口相关业务、SQL查询等,不要盲目加大超时时间,否则会导致线程堆积过多,hystrix 线程池卡死,最终服务不可用。
timeoutInMilliseconds: ${HYSTRIX_COMMAND_TIMEOUT_IN_MILLISECONDS:40000}

ribbon:

客户端读取超时时间,超时时间要小于Hystrix的超时时间,否则重试机制就无意义了

ReadTimeout: ${RIBBON_READ_TIMEOUT:30000}

客户端连接超时时间

ConnectTimeout: ${RIBBON_CONNECT_TIMEOUT:3000}

访问实例失败(超时),允许自动重试,设置重试次数,失败后会更换实例访问,请一定确保接口的幂等性,否则重试可能导致数据异常。

OkToRetryOnAllOperations: true
MaxAutoRetries: 1
MaxAutoRetriesNextServer: 1

感谢反馈
解释下我们这个配置的原因
日志中调用的这个接口:
ScheduleConsumerClient#pollBatch(PollScheduleInstanceDTO) failed and encountered unrecoverable error.
是用于服务定时的拉取,需要执行的定时任务;但是有时候并没有定时任务需要执行,在查询到没有定时任务需要执行时,会异步等待一段时间返回;等待时间可更改choerodon.asgard.time-out默认60秒;然后就导致主动拉取的服务超时了,这个我们后续会优化等待逻辑。
目前可将choerodon.asgard.time-out改小点,其他超时时间根据情况可自行更改。

你好,大佬,下午我有仔细查看了下相关问题,日志中也打印有其他接口类似的错误,比如:2021-10-27 09:26:08.456 ERROR 8 — [ XNIO-3 task-63] o.h.core.exception.BaseExceptionHandler : Hystrix exception, Request: {URI=/v1/users/tenant-id, method=public org.springframework.http.ResponseEntity<java.util.Map<java.lang.String, java.lang.Long>> org.hzero.iam.api.controller.v1.UserDetailsSiteController.storeUserTenant(java.lang.Long)}, User: CustomUserDetails{userId=202848753701806080, username=45660311, roleId=165487611585368064, roleIds=[165487615729340416, 165487611585368064], siteRoleIds=[], tenantRoleIds=[165487615729340416, 165487611585368064], roleMergeFlag=true, tenantId=24, tenantIds=[24, 15, 1, 534], organizationId=1, isAdmin=false, clientId=null, timeZone='GMT+8, language=‘zh_CN, roleLabels=’[TENANT_ROLE, PROJECT_MEMBER, TENANT_MEMBER, GITLAB_DEVELOPER, PROJECT_ROLE], apiEncryptFlag=1}

com.netflix.hystrix.exception.HystrixRuntimeException: UserDetailsClient#storeUserTenant(String,Long) failed and encountered unrecoverable error.

个人感觉是Hystrix 相关的配置可能有问题,导致出现 HystrixRuntimeException,然后提示:
Caused by: java.lang.OutOfMemoryError: Failed to create a thread: retVal -1073741830, errno 11
目前给choerodon-iam的资源是4C8G,我先按照您的提示修改一下配置观察一下