分布式事务Seata AT模式：undo_log机制与全局锁的性能代价

老张2026/4/30大约 11 分钟

分布式事务Seata AT模式：undo_log机制与全局锁的性能代价

适读人群：中高级Java工程师 | 阅读时长：约20分钟 | 技术栈：Spring Boot 3.x、Seata 2.x、MySQL 8.0

开篇故事

2022年我们把一个电商系统改造成微服务，最头疼的不是拆服务，而是拆完之后的分布式事务。

最典型的场景：用户下单，需要同时操作订单库（创建订单）和库存库（扣减库存）。改造前是同一个数据库，一个本地事务搞定。改造后订单服务和库存服务分别有自己的数据库，两个数据库操作要么同时成功，要么同时失败。

技术负责人拍板用 Seata AT 模式，理由是对业务代码侵入最小，加一个注解 @GlobalTransactional 就行。我当时对 Seata 不熟悉，就按文档配上去了。

测试环境跑得挺好。上线后第一个问题在第三天暴露出来：高峰期下单接口 P99 从 150ms 飙到了 800ms，数据库监控显示有大量"Lock wait timeout exceeded"。继续排查，发现是 Seata 的全局锁在高并发下产生了大量锁等待，而我们的系统有将近 20% 的接口在高峰期都走了 @GlobalTransactional，导致锁争用极其严重。

那次调优花了将近两周，把系统吃透了。今天把 Seata AT 模式的原理和性能代价彻底讲清楚。

一、核心问题分析

分布式事务的本质困难

在单数据库里，事务的原子性由数据库引擎的 undo log 和 redo log 保证，可以做到提交或回滚的原子性。在多数据库场景下，两个数据库各自有本地事务，问题在于：

当我们向数据库 A commit 成功后，还没来得及向数据库 B commit，服务进程崩溃了。此时 A 已提交，B 没提交，数据不一致。如果先向 B commit，B 成功后向 A commit 失败，同样不一致。这就是分布式事务的核心困难。

经典的两阶段提交（2PC）理论上可以解决这个问题，但有性能问题：协调者宕机会导致参与者永久阻塞。Seata AT 模式是对 2PC 的一种改进实现。

Seata AT 的核心创新

Seata AT 的核心创新在于：在第一阶段就提交本地事务，不持有数据库锁，通过 undo_log 实现回滚。这解决了 2PC 第一阶段锁持有时间过长的问题，代价是放弃了隔离性（第一阶段提交后，数据对外可见），用全局锁来部分补偿隔离性的损失。

二、原理深度解析

AT 模式两阶段流程

undo_log 的结构

Seata 在每个参与分布式事务的数据库里创建一张 undo_log 表，记录每次 SQL 执行前后的数据快照：

CREATE TABLE undo_log (
    branch_id     BIGINT NOT NULL COMMENT '分支事务ID',
    xid           VARCHAR(128) NOT NULL COMMENT '全局事务ID',
    context       VARCHAR(128) NOT NULL COMMENT '上下文',
    rollback_info LONGBLOB NOT NULL COMMENT '回滚数据（JSON格式）',
    log_status    INT NOT NULL COMMENT '0:正常, 1:防御性插入',
    log_created   DATETIME(6) NOT NULL,
    log_modified  DATETIME(6) NOT NULL,
    UNIQUE KEY ux_undo_log (xid, branch_id)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8mb4;

rollback_info 的 JSON 内容大致如下：

{
  "branchType": "AT",
  "sqlUndoLogs": [
    {
      "tableName": "product",
      "sqlType": "UPDATE",
      "beforeImage": {
        "rows": [{"fields": [{"name": "id", "value": 1}, {"name": "stock", "value": 100}]}]
      },
      "afterImage": {
        "rows": [{"fields": [{"name": "id", "value": 1}, {"name": "stock", "value": 99}]}]
      }
    }
  ]
}

回滚时，Seata 根据 beforeImage 生成 UNDO SQL（将 afterImage 的数据改回 beforeImage），并校验当前数据与 afterImage 是否一致（防止其他事务在此期间修改了数据），一致才执行回滚，不一致则报"脏写"告警，需要人工介入。

全局锁的工作原理

AT 模式中，全局锁是防止两个全局事务同时修改同一行数据的机制。全局锁的信息存储在 Seata TC Server 中（或数据库中，取决于配置）。

全局锁的粒度是：表名 + 主键值。例如：product:1 表示对 product 表 id=1 的行加了全局锁。

全局锁的生命周期：从第一阶段本地事务提交时申请（注册分支），到第二阶段 commit 或 rollback 完成时释放。

这就是性能问题的根源：全局锁的持有时间 = 第一阶段提交后 + 等待所有分支都完成 + 第二阶段完成。如果有分支执行很慢，所有等待同一行数据的全局事务都会被阻塞。

三、完整代码实现

项目依赖

<dependency>
    <groupId>com.alibaba.cloud</groupId>
    <artifactId>spring-cloud-starter-alibaba-seata</artifactId>
    <version>2022.0.0.0</version>
</dependency>
<dependency>
    <groupId>io.seata</groupId>
    <artifactId>seata-spring-boot-starter</artifactId>
    <version>2.0.0</version>
</dependency>

Seata 配置

seata:
  enabled: true
  application-id: ${spring.application.name}
  tx-service-group: myapp-tx-group
  service:
    vgroup-mapping:
      myapp-tx-group: default
  registry:
    type: nacos
    nacos:
      server-addr: 127.0.0.1:8848
      group: SEATA_GROUP
      namespace: seata
  config:
    type: nacos
    nacos:
      server-addr: 127.0.0.1:8848
      group: SEATA_GROUP
      namespace: seata
  client:
    rm:
      report-success-enable: false
      lock:
        retry-times: 30          # 全局锁重试次数
        retry-interval: 10       # 重试间隔（ms）
        retry-policy-branch-rollback-on-conflict: true
    tm:
      default-global-transaction-timeout: 60000  # 全局事务超时（ms）
      rollback-on-commit-failure-enable: false

订单服务（TM 发起方）

@Service
@Slf4j
public class OrderService {

    @Autowired
    private OrderMapper orderMapper;

    @Autowired
    private InventoryFeignClient inventoryFeignClient;

    @Autowired
    private AccountFeignClient accountFeignClient;

    /**
     * 下单（全局事务入口）
     * @GlobalTransactional 开启全局事务
     */
    @GlobalTransactional(name = "create-order-tx", rollbackFor = Exception.class,
                         timeoutMills = 30000)
    public OrderResult createOrder(CreateOrderRequest request) {
        log.info("开启全局事务，XID={}", RootContext.getXID());

        // 1. 创建订单（本地事务，提交后记录 undo_log）
        Order order = buildOrder(request);
        orderMapper.insert(order);
        log.info("订单创建成功，orderId={}", order.getId());

        // 2. 扣减库存（远程调用，携带 XID 传播）
        InventoryDeductRequest deductRequest = new InventoryDeductRequest(
            request.getProductId(), request.getQuantity());
        InventoryResult inventoryResult = inventoryFeignClient.deductStock(deductRequest);
        if (!inventoryResult.isSuccess()) {
            // 抛出异常触发全局事务回滚
            throw new BusinessException("库存不足，productId=" + request.getProductId());
        }

        // 3. 扣减账户余额
        AccountDeductRequest accountRequest = new AccountDeductRequest(
            request.getUserId(), request.getTotalAmount());
        AccountResult accountResult = accountFeignClient.deductBalance(accountRequest);
        if (!accountResult.isSuccess()) {
            throw new BusinessException("余额不足，userId=" + request.getUserId());
        }

        return OrderResult.success(order.getId());
    }

    private Order buildOrder(CreateOrderRequest request) {
        return Order.builder()
            .userId(request.getUserId())
            .productId(request.getProductId())
            .quantity(request.getQuantity())
            .totalAmount(request.getTotalAmount())
            .status(OrderStatus.CREATED)
            .createTime(LocalDateTime.now())
            .build();
    }
}

库存服务（RM 参与方）

@Service
@Slf4j
public class InventoryService {

    @Autowired
    private InventoryMapper inventoryMapper;

    /**
     * 扣减库存（分支事务）
     * 注意：这里不需要 @GlobalTransactional，只需要普通 @Transactional
     * Seata 的数据源代理会自动拦截 SQL，记录 undo_log
     */
    @Transactional(rollbackFor = Exception.class)
    public InventoryResult deductStock(Long productId, int quantity) {
        log.info("执行库存扣减分支事务，XID={}, productId={}, quantity={}",
            RootContext.getXID(), productId, quantity);

        Inventory inventory = inventoryMapper.selectForUpdate(productId);
        if (inventory == null) {
            throw new BusinessException("商品不存在: " + productId);
        }

        if (inventory.getStock() < quantity) {
            return InventoryResult.failure("库存不足");
        }

        int rows = inventoryMapper.deductStock(productId, quantity);
        if (rows == 0) {
            throw new BusinessException("库存扣减失败（并发冲突）");
        }

        return InventoryResult.success();
    }
}

数据源代理配置（AT 模式核心）

@Configuration
public class DataSourceProxyConfig {

    /**
     * Seata AT 模式需要用 DataSourceProxy 代理数据源
     * 代理会拦截 SQL，自动记录 undo_log
     */
    @Bean
    @Primary
    public DataSource dataSource(DataSourceProperties properties) {
        HikariDataSource hikariDataSource = properties.initializeDataSourceBuilder()
            .type(HikariDataSource.class)
            .build();
        // 设置合理的连接池参数
        hikariDataSource.setMaximumPoolSize(20);
        hikariDataSource.setMinimumIdle(5);
        hikariDataSource.setConnectionTimeout(3000);
        return new DataSourceProxy(hikariDataSource);
    }
}

防悬挂与空回滚处理

/**
 * 全局事务超时时，TC 会发起回滚，但此时一阶段可能还没执行
 * 需要在 undo_log 中插入防御性记录（log_status=1），防止后续的一阶段提交
 * Seata 2.x 已内置此逻辑，但了解原理很重要
 */
@Service
@Slf4j
public class AntiHangService {

    @Autowired
    private UndoLogMapper undoLogMapper;

    /**
     * 检查是否需要空回滚（一阶段未执行，TC 发来了回滚请求）
     */
    public boolean needEmptyRollback(String xid, long branchId) {
        // 如果 undo_log 里没有这条记录，说明一阶段还没执行，需要空回滚
        return undoLogMapper.selectByXidAndBranchId(xid, branchId) == null;
    }

    /**
     * 插入防悬挂记录（防止一阶段在空回滚后继续执行提交）
     */
    public void insertSuspensionRecord(String xid, long branchId) {
        UndoLog suspensionLog = UndoLog.builder()
            .xid(xid)
            .branchId(branchId)
            .logStatus(1) // 1 表示防御性插入
            .build();
        undoLogMapper.insert(suspensionLog);
    }
}

四、生产调优与配置

全局锁超时调优

# Seata 全局锁重试配置
seata.client.rm.lock.retry-times=30
seata.client.rm.lock.retry-interval=10
# 最大等待时间 = retry-times * retry-interval = 300ms
# 根据业务P99调整，太短会误报锁冲突，太长影响用户体验

TC Server 存储模式

开发环境可以用 file 模式（本地文件存储），生产必须用 db 模式或 redis 模式：

# seata-server 配置（store.conf）
store.mode=db
store.db.datasource=druid
store.db.dbType=mysql
store.db.driverClassName=com.mysql.cj.jdbc.Driver
store.db.url=jdbc:mysql://127.0.0.1:3306/seata
store.db.user=root
store.db.password=your_password
store.db.minConn=5
store.db.maxConn=30
store.db.globalTable=global_table
store.db.branchTable=branch_table
store.db.lockTable=lock_table
store.db.queryLimit=100

undo_log 清理

undo_log 在事务成功提交后由 Seata 异步删除，但如果清理失败（网络问题、TC Server 重启），会有残留数据。生产建议加一个定时清理任务：

@Component
@Slf4j
public class UndoLogCleanTask {

    @Autowired
    private UndoLogMapper undoLogMapper;

    // 每天凌晨3点清理7天前的残留 undo_log
    @Scheduled(cron = "0 0 3 * * ?")
    public void cleanExpiredUndoLog() {
        LocalDateTime expiredTime = LocalDateTime.now().minusDays(7);
        int deleted = undoLogMapper.deleteExpired(expiredTime);
        log.info("清理过期 undo_log，删除数量={}", deleted);
    }
}

五、踩坑实录

坑一：全局锁导致性能雪崩（开篇故事复盘）

原因分析：我们有一个商品详情页的"浏览量"更新也加了 @GlobalTransactional（因为复用了一个通用的"事务性更新"方法），浏览量更新对 product 表每行都会加全局锁，而商品列表页每次请求都会触发这个操作，导致热门商品的全局锁争用极其激烈。

解决方案分三步：一是把非必要的全局事务去掉，浏览量更新改成异步消息队列处理，完全不需要分布式事务。二是热点数据（热门商品库存）改用 TCC 模式，TCC 不需要全局锁（自己实现的锁粒度更细）。三是对全局事务加监控告警：全局事务平均耗时超过 200ms 就报警。

改造后 P99 从 800ms 降回 160ms，全局锁等待超时率从 12% 降到 0.01%。

坑二：脏写导致回滚失败

有一次我们发现 Seata 的事务回滚日志里有报错：Branch rollback failed: Has dirty records。

原因是：Seata AT 回滚时，会对比当前数据与 afterImage，如果不一致，说明在全局事务提交前，有其他非 Seata 事务修改了这行数据（俗称脏写）。Seata 发现脏写后不会强制回滚，而是报告错误，需要人工介入。

脏写的来源：有一个定时任务用的是普通 JDBC 操作（未走 Seata 数据源代理），直接修改了被全局事务持有全局锁的行。解决方案是让所有访问这些表的操作都走 Seata 数据源代理，或者全局事务提交前，定时任务不操作相关数据。

坑三：undo_log 膨胀导致磁盘告警

我们的 undo_log 表没有设置自动清理，运行了两个月后，表大小增长到 40GB，触发了磁盘告警。Seata 的清理机制依赖 TC Server 在事务提交后发送删除指令，但 TC Server 曾经有一次宕机（约 20 分钟），期间大量残留的 undo_log 没有被清理。

加了上面说的定时清理任务之后，表大小稳定在 500MB 以内。

六、总结

Seata AT 模式的核心优势是对业务代码侵入小，加一个注解就能获得分布式事务能力。但它的性能代价也是真实存在的：

一是全局锁的持有时间比本地锁长，高并发下容易出现锁等待。二是每个分支事务都要写 undo_log，额外的存储和 I/O 开销不可忽视。三是回滚可能失败（脏写问题），不如本地事务那样"一定能回滚"。

使用建议：AT 模式适合并发量中等、对性能要求不极致、需要快速落地的场景。高并发热点数据的分布式事务，考虑 TCC 模式（下一篇）或消息最终一致性方案。