记一次线上Spring和Dubbo死锁排查

问题

前不久线上发现有系统间数据没有同步,排查一通下来发现应该是 MQ 消息没有被消费,通过 MQ Console 发现,未被消费的消息全都集中的一台机器上(消息投放的queue 以及 rebalance 关系),此时我有两个怀疑,一是消费者线程挂了,二是此服务分配到的消息队列出了什么莫名的问题。由于正值业务高峰,领导第一时间重启了服务,重启后一切恢复正常。排查由于一些其它事项,也到此中断。

然而没过多久,我听到其它需求项目组在测试环境出现了相同的问题,也是消息不消费了,导致业务异常。我立马找到对应服务 dump 了线程。

我将 dump 文件通过 visualVM 打开,然后得到了很明显的提示:Found one Java-level deadlock:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
"ConsumeMessageThread_1":
waiting to lock monitor 0x00007f1350390f18 (object 0x000000008024ebc0, a java.util.concurrent.ConcurrentHashMap),
which is held by "main"
"main":
waiting to lock monitor 0x00007f1330071f18 (object 0x0000000080fb6318, a org.apache.dubbo.config.deploy.DefaultModuleDeployer),
which is held by "Thread-26"
"Thread-26":
waiting to lock monitor 0x00007f1350390f18 (object 0x000000008024ebc0, a java.util.concurrent.ConcurrentHashMap),
which is held by "main"

Java stack information for the threads listed above:
===================================================
"ConsumeMessageThread_1":
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:187)
- waiting to lock <0x000000008024ebc0> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.isTypeMatch(AbstractBeanFactory.java:486)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.doGetBeanNamesForType(DefaultListableBeanFactory.java:432)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeanNamesForType(DefaultListableBeanFactory.java:403)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeansOfType(DefaultListableBeanFactory.java:515)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.getBeansOfType(DefaultListableBeanFactory.java:508)
at org.springframework.context.support.AbstractApplicationContext.getBeansOfType(AbstractApplicationContext.java:1186)
...
at org.apache.rocketmq.client.impl.consumer.ConsumeMessageConcurrentlyService$ConsumeRequest.run(ConsumeMessageConcurrentlyService.java:411)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"main":
at org.apache.dubbo.config.deploy.DefaultModuleDeployer.startSync(DefaultModuleDeployer.java)
- waiting to lock <0x0000000080fb6318> (a org.apache.dubbo.config.deploy.DefaultModuleDeployer)
at org.apache.dubbo.config.deploy.DefaultModuleDeployer.start(DefaultModuleDeployer.java:139)
at org.apache.dubbo.config.ReferenceConfig.get(ReferenceConfig.java:228)
...
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:162)
at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:588)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAutowireCapableBeanFactory.java:1173)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1067)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:513)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:483)
at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:306)
at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
- locked <0x000000008024ebc0> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:302)
at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:761)
at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:867)
at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:543)
- locked <0x00000000804b98b0> (a java.lang.Object)
at org.springframework.boot.context.embedded.EmbeddedWebApplicationContext.refresh(EmbeddedWebApplicationContext.java:122)
at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:693)
at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:360)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:303)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1118)
at org.springframework.boot.SpringApplication.run(SpringApplication.java:1107)
at com.internet.saasbillmanager.ApplicationMain.main(ApplicationMain.java:46)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87)
at org.springframework.boot.loader.Launcher.launch(Launcher.java:51)
at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:52)
"Thread-26":
at org.springframework.context.event.AbstractApplicationEventMulticaster.getApplicationListeners(AbstractApplicationEventMulticaster.java:185)
- waiting to lock <0x000000008024ebc0> (a java.util.concurrent.ConcurrentHashMap)
at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:128)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:393)
at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:347)
at org.apache.dubbo.config.spring.context.DubboDeployApplicationListener.publishEvent(DubboDeployApplicationListener.java:91)
at org.apache.dubbo.config.spring.context.DubboDeployApplicationListener.access$000(DubboDeployApplicationListener.java:47)
at org.apache.dubbo.config.spring.context.DubboDeployApplicationListener$1.onStarted(DubboDeployApplicationListener.java:70)
at org.apache.dubbo.config.spring.context.DubboDeployApplicationListener$1.onStarted(DubboDeployApplicationListener.java:62)
at org.apache.dubbo.common.deploy.AbstractDeployer.setStarted(AbstractDeployer.java:121)
at org.apache.dubbo.config.deploy.DefaultApplicationDeployer.onStarted(DefaultApplicationDeployer.java:989)
at org.apache.dubbo.config.deploy.DefaultApplicationDeployer.checkState(DefaultApplicationDeployer.java:868)
- locked <0x0000000080fa6c10> (a java.lang.Object)
at org.apache.dubbo.config.deploy.DefaultApplicationDeployer.notifyModuleChanged(DefaultApplicationDeployer.java:851)
at org.apache.dubbo.config.deploy.DefaultModuleDeployer.onModuleStarted(DefaultModuleDeployer.java:264)
at org.apache.dubbo.config.deploy.DefaultModuleDeployer.startSync(DefaultModuleDeployer.java:171)
- locked <0x0000000080fb6318> (a org.apache.dubbo.config.deploy.DefaultModuleDeployer)
at org.apache.dubbo.config.deploy.DefaultModuleDeployer.start(DefaultModuleDeployer.java:139)
at org.apache.dubbo.config.ReferenceConfig.get(ReferenceConfig.java:228)
at org.apache.dubbo.config.spring.ReferenceBean.getCallProxy(ReferenceBean.java:346)
at org.apache.dubbo.config.spring.ReferenceBean.access$100(ReferenceBean.java:99)
at org.apache.dubbo.config.spring.ReferenceBean$DubboReferenceLazyInitTargetSource.createObject(ReferenceBean.java:353)
at org.springframework.aop.target.AbstractLazyCreationTargetSource.getTarget(AbstractLazyCreationTargetSource.java:86)
- locked <0x0000000085b49868> (a org.apache.dubbo.config.spring.ReferenceBean$DubboReferenceLazyInitTargetSource)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:192)
...
at com.xxl.job.core.thread.JobThread.run(JobThread.java:152)

Found 1 deadlock.

这里有 3 个线程:

  • “ConsumeMessageThread_1” MQ 消费者线程

  • “main”: 主线程

  • “Thread-26”: XXL-JOB 执行线程

其实主要是这两个线程引发的问题:

  • “main”: 主线程

  • “Thread-26”: XXL-JOB 执行线程

MQ 消费者线程是正好撞在枪口上了。

涉及两个锁:

  • DefaultModuleDeployer:object monitor lock

  • DefaultSingletonBeanRegistry.singletonObjects:object monitor lock

暂且将第一个称为 Dubbo 锁,第二个称为 Spring 锁。

“main”: 主线程做的事情是初始化一个基础服务(Dubbo Consumer)注册到 Spring 和 Dubbo 中, 而其先获取到了 Spring 锁,再去获取 Dubbo 锁。

“Thread-26”:job 线程是由 xxljob 触发后执行一个任务,需要调用一个远程的 Dubbo 服务,于是 先获取了 Dubbo 锁,再去获取 Spring 锁。

于是两个线程就互相锁死了,而 MQ 消费者线程,也要去Bean Container 中获取 Bean,也需要获取 Spring 锁,也就卡死了。

解决方案

这个问题由于是在启动过程中发生,所以我设想了两个解决方案(没有好的契机去解决这个问题,只能等到有相关需求了),其实思路是同一个,那就是将出问题的 Bean 给延迟加载:

  • 将“基础服务”Bean 给@Lazy。
  • 将“XxlJobSpringExecutor”在spring启动完成之后再注册进容器,避免启动过程中收到任务的执行命令。