记一次线上死锁排查

前段时间偶尔会收到线上MySQL死锁告警通知,由于有补偿机制,最终业务会处理成功,所以没太关心。

1
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLTransactionRollbackException: Deadlock found when trying to get lock; try restarting transaction

最近又收到了相同的告警,可能不是偶然事件,于是开始排查。
首先翻看了日志,结合代码,没有发现什么问题。事发时应该也没有什么大批量并发事件。

于是向DBA要来了deadlock log,日志内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
LATEST DETECTED DEADLOCK
------------------------
2021-03-22 16:02:01 0x7f1cfc289700
*** (1) TRANSACTION:
TRANSACTION 153411874, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 8 lock struct(s), heap size 1136, 23 row lock(s), undo log entries 22
MySQL thread id 8203444, OS thread handle 139762013996800, query id 766107488 updating
update test
SET status = 4,
version = version+1,
gmt_modified = '2021-03-22 16:02:01'
where id = 1
and version=14
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 639 page no 2931 n bits 128 index PRIMARY of table `test` trx id 153411874 lock_mode X locks rec but not gap waiting
Record lock, heap no 40 PHYSICAL RECORD: n_fields 45; compact format; info bits 0



*** (2) TRANSACTION:
TRANSACTION 153411876, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
8 lock struct(s), heap size 1136, 7 row lock(s), undo log entries 6
MySQL thread id 8204162, OS thread handle 139762466330368, query id 766107489 updating
update test
SET approval_status = 5,
biz_date = '2021-02-23 00:00:00',
modified_id = 123,
modified_name = 'xxx',
contact_company_id = 456,
contact_company_n
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 639 page no 2931 n bits 128 index PRIMARY of table `test` trx id 153411876 lock_mode X locks rec but not gap
Record lock, heap no 40 PHYSICAL RECORD: n_fields 45; compact format; info bits 0

*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 639 page no 2931 n bits 128 index PRIMARY of table `test` trx id 153411876 lock_mode X locks rec but not gap waiting
Record lock, heap no 28 PHYSICAL RECORD: n_fields 45; compact format; info bits 0

上面日志只保留了关键部分,表名等也进行了脱敏。
可以看到两个事务都在等待PRIMARY也就是主键索引

1
RECORD LOCKS space id 639 page no 2931 n bits 128 index PRIMARY of table `test`

然后再结合日志中的SQL和代码,发现问题的原因如下:

一共涉及3个系统,分别是S、C、F,在S系统中有一个业务操作完成后会给C,F发消息,C收到S的消息处理完成后也会给F发一个消息,如图所示:

1.png

在F中几乎同时开启了两个事务,并且两个消息在F中涉及的记录是相同的,id分别是1,2,表为test。
于是出现了事务一:

1
2
3
4
5
SQL1:
update test set status = 4 ...... where id=1

SQL2:
update test set status = 4 ...... where id=2

事务二:

1
2
3
4
5
SQL1:
update test SET approval_status = 5 ...... where id =2

SQL2:
update test SET approval_status = 5 ...... where id =1

事务一先持有id=1的锁,事务二持有id=2的锁,事务一尝试获取id=2的锁,事务二尝试获取id=1的锁,所以死锁了。最后MySQL回滚了事务二。

发生这种情况是由于产品和系统设计不合理导致的,目前正在重构中。

如果下次再发生死锁直接看deadlock log吧,没必要浪费时间翻代码看业务日志了。

作者

太阳当空赵先生

发布于

2021-03-23

更新于

2022-05-23

许可协议

评论