AI应用的批处理系统：用Spring Batch处理海量AI任务

老张2026/4/30大约 24 分钟

AI应用的批处理系统：用Spring Batch处理海量AI任务

date: 2026-10-18 tags: [Spring Batch, 批处理, 大规模AI, Java]

开篇故事：100万条数据，从3个月到2天

2025年11月，某头部电商平台的数据团队负责人李明坐在工位上，盯着屏幕上那行数字发呆：1,024,618。

这是他们积压的待分析客户记录数量。

产品经理早在三个月前就提需求了——对全量活跃用户做一次AI画像分析，输出购买意图评分、流失风险等级和个性化推荐标签。需求很合理，数据也备好了，但李明一直没敢动手。

原因很简单：他的同事小王做过估算。

单线程调用GPT-4o处理一条客户数据，平均耗时2.3秒，含重试和限流等待大概3秒。100万条数据：100万 × 3秒 = 300万秒 ≈ 35天。

35天还算乐观估计。实际跑起来网络抖动、API限流、服务故障……保守估计要跑3个月。

产品经理等不了。季度末大促需要用到这份数据。

李明找来了有Spring Batch经验的高级工程师张磊。张磊看了一眼需求，打开笔记本开始画架构图。

"并行化。"张磊说，"数据分区，50个并发线程，每个线程独立调用AI服务。理论上可以把时间压缩到35天 ÷ 50 = 17小时。考虑API速率限制，给它48小时，完全够。"

李明皱眉："之前也想过多线程，但担心数据丢失，中途挂了怎么办？"

张磊笑了："Spring Batch的JobRepository天生就是干这个的。断点续传，从上次停的地方继续，一条数据都不会丢。"

48小时后，李明看到了任务完成的通知邮件。

实际耗时：44小时23分钟。处理了1,024,618条记录，成功率99.97%，仅612条因数据质量问题被跳过并记录在册。

AI调用费用：约$1,847，比预算低了30%——因为批处理统一了Prompt模板，减少了Token浪费。

这就是Spring Batch在AI批处理场景的力量。本文将带你从零构建这套系统。

一、为什么AI任务需要批处理框架

在深入代码之前，我们先理解问题本质。

1.1 AI批处理的特殊挑战

普通的数据库批处理，一条SQL可以UPDATE百万行。但AI批处理完全不同：

传统批处理：
数据库 → SQL批量更新 → 数据库
耗时：分钟级，无外部依赖

AI批处理：
数据库 → 读取 → 调用AI API（网络IO，200ms-2000ms）→ 写入
耗时：小时到天级，强依赖外部服务，失败率高

AI批处理的核心挑战：

挑战	影响	应对方案
API限流（Rate Limiting）	超过QPSLimit被拒绝	令牌桶算法，动态调速
网络超时	单次调用失败	重试机制，指数退避
服务宕机	大批量失败	断点续传，失败记录
Token超限	输入过长报错	预处理截断，分段处理
费用失控	预算超支	Token预估，预算熔断
结果质量差	无效输出	输出验证，人工审核队列

1.2 Spring Batch核心概念

三个核心概念：

Job：一次完整的批处理任务。类比：一次完整的客户分析作业
Step：Job的一个执行阶段。类比：数据准备→AI分析→结果汇总
Chunk：Step中的分块处理单位。类比：每次取1000条，处理完提交一次事务

1.3 技术选型

<!-- pom.xml 核心依赖 -->
<dependencies>
    <!-- Spring Batch核心 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-batch</artifactId>
    </dependency>
    
    <!-- Spring AI（OpenAI/国内模型均支持）-->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    
    <!-- 数据库（JobRepository存储）-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-jpa</artifactId>
    </dependency>
    <dependency>
        <groupId>com.mysql</groupId>
        <artifactId>mysql-connector-j</artifactId>
    </dependency>
    
    <!-- ElasticSearch输出 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-elasticsearch</artifactId>
    </dependency>
    
    <!-- Redis（分布式锁、进度缓存）-->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-redis</artifactId>
    </dependency>
    
    <!-- 重试机制 -->
    <dependency>
        <groupId>org.springframework.retry</groupId>
        <artifactId>spring-retry</artifactId>
    </dependency>
    
    <!-- 监控 -->
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-actuator</artifactId>
    </dependency>
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
</dependencies>

二、领域模型设计

2.1 数据库表结构

-- 客户数据源表
CREATE TABLE customer_profile (
    id          BIGINT PRIMARY KEY AUTO_INCREMENT,
    customer_id VARCHAR(64) NOT NULL UNIQUE,
    name        VARCHAR(128),
    email       VARCHAR(256),
    purchase_history TEXT,       -- JSON格式购买历史
    behavior_data    TEXT,       -- JSON格式行为数据
    created_at  DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at  DATETIME DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_customer_id (customer_id)
);

-- AI分析结果表
CREATE TABLE customer_ai_analysis (
    id              BIGINT PRIMARY KEY AUTO_INCREMENT,
    customer_id     VARCHAR(64) NOT NULL UNIQUE,
    purchase_intent_score   DECIMAL(5,4),  -- 0.0000~1.0000
    churn_risk_level        VARCHAR(16),   -- LOW/MEDIUM/HIGH/CRITICAL
    recommend_tags          TEXT,          -- JSON数组
    analysis_summary        TEXT,
    model_version           VARCHAR(64),
    prompt_tokens           INT,
    completion_tokens       INT,
    processing_time_ms      INT,
    status          VARCHAR(16) DEFAULT 'PENDING', -- PENDING/SUCCESS/FAILED/SKIPPED
    error_message   TEXT,
    batch_job_id    BIGINT,
    created_at      DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_status (status),
    INDEX idx_batch_job_id (batch_job_id)
);

-- 批次控制表（自定义，补充Spring Batch自带表）
CREATE TABLE batch_job_control (
    id              BIGINT PRIMARY KEY AUTO_INCREMENT,
    job_name        VARCHAR(128),
    total_count     BIGINT,
    processed_count BIGINT DEFAULT 0,
    success_count   BIGINT DEFAULT 0,
    failed_count    BIGINT DEFAULT 0,
    estimated_cost  DECIMAL(10,4),
    actual_cost     DECIMAL(10,4),
    started_at      DATETIME,
    finished_at     DATETIME,
    status          VARCHAR(16)
);

2.2 Java领域对象

// CustomerProfile.java
@Entity
@Table(name = "customer_profile")
@Data
@NoArgsConstructor
@AllArgsConstructor
public class CustomerProfile {
    
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    
    @Column(name = "customer_id", nullable = false, unique = true)
    private String customerId;
    
    private String name;
    private String email;
    
    @Column(name = "purchase_history", columnDefinition = "TEXT")
    private String purchaseHistory; // JSON
    
    @Column(name = "behavior_data", columnDefinition = "TEXT")
    private String behaviorData; // JSON
    
    @Column(name = "created_at")
    private LocalDateTime createdAt;
}

// CustomerAiAnalysis.java
@Entity
@Table(name = "customer_ai_analysis")
@Data
@NoArgsConstructor
@AllArgsConstructor
@Builder
public class CustomerAiAnalysis {
    
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long id;
    
    @Column(name = "customer_id", nullable = false, unique = true)
    private String customerId;
    
    @Column(name = "purchase_intent_score")
    private BigDecimal purchaseIntentScore;
    
    @Column(name = "churn_risk_level")
    @Enumerated(EnumType.STRING)
    private ChurnRiskLevel churnRiskLevel;
    
    @Column(name = "recommend_tags", columnDefinition = "TEXT")
    private String recommendTags; // JSON
    
    @Column(name = "analysis_summary", columnDefinition = "TEXT")
    private String analysisSummary;
    
    @Column(name = "model_version")
    private String modelVersion;
    
    @Column(name = "prompt_tokens")
    private Integer promptTokens;
    
    @Column(name = "completion_tokens")
    private Integer completionTokens;
    
    @Column(name = "processing_time_ms")
    private Integer processingTimeMs;
    
    @Column(name = "status")
    @Enumerated(EnumType.STRING)
    private AnalysisStatus status;
    
    @Column(name = "error_message", columnDefinition = "TEXT")
    private String errorMessage;
    
    @Column(name = "batch_job_id")
    private Long batchJobId;
    
    @Column(name = "created_at")
    private LocalDateTime createdAt;
    
    public enum ChurnRiskLevel {
        LOW, MEDIUM, HIGH, CRITICAL
    }
    
    public enum AnalysisStatus {
        PENDING, SUCCESS, FAILED, SKIPPED
    }
}

// AI分析请求/响应DTO
@Data
@Builder
public class CustomerAnalysisRequest {
    private String customerId;
    private String purchaseHistory;
    private String behaviorData;
    private int maxTokens;
}

@Data
@Builder
public class CustomerAnalysisResult {
    private String customerId;
    private double purchaseIntentScore;
    private String churnRiskLevel;
    private List<String> recommendTags;
    private String analysisSummary;
    private String modelVersion;
    private int promptTokens;
    private int completionTokens;
    private long processingTimeMs;
}

三、ItemReader：高效读取海量数据

3.1 数据库分页读取（JdbcPagingItemReader）

// CustomerItemReaderConfig.java
@Configuration
@RequiredArgsConstructor
public class CustomerItemReaderConfig {
    
    private final DataSource dataSource;
    
    /**
     * 数据库分页读取器 - 推荐用于大数据量
     * 使用JdbcPagingItemReader而非JdbcCursorItemReader
     * 原因：分页读取支持多线程，游标读取不支持并发
     */
    @Bean
    @StepScope
    public JdbcPagingItemReader<CustomerProfile> customerPagingReader(
            @Value("#{stepExecutionContext['minId']}") Long minId,
            @Value("#{stepExecutionContext['maxId']}") Long maxId) {
        
        // 构建SQL查询（按分区范围过滤）
        Map<String, Object> parameterValues = new HashMap<>();
        parameterValues.put("minId", minId);
        parameterValues.put("maxId", maxId);
        
        // 只读取未处理的数据（支持断点续传）
        MySqlPagingQueryProvider queryProvider = new MySqlPagingQueryProvider();
        queryProvider.setSelectClause("SELECT cp.*");
        queryProvider.setFromClause("FROM customer_profile cp");
        queryProvider.setWhereClause(
            "WHERE cp.id BETWEEN :minId AND :maxId " +
            "AND NOT EXISTS (" +
            "  SELECT 1 FROM customer_ai_analysis caa " +
            "  WHERE caa.customer_id = cp.customer_id " +
            "  AND caa.status = 'SUCCESS'" +
            ")"
        );
        
        Map<String, Order> sortKeys = new LinkedHashMap<>();
        sortKeys.put("id", Order.ASCENDING);
        queryProvider.setSortKeys(sortKeys);
        
        JdbcPagingItemReader<CustomerProfile> reader = new JdbcPagingItemReaderBuilder<CustomerProfile>()
            .name("customerPagingReader")
            .dataSource(dataSource)
            .queryProvider(queryProvider)
            .parameterValues(parameterValues)
            .pageSize(500)  // 每页500条，平衡内存和性能
            .rowMapper(new CustomerProfileRowMapper())
            .build();
        
        return reader;
    }
    
    /**
     * 文件读取器 - 从CSV/JSON文件读取
     */
    @Bean
    @StepScope
    public FlatFileItemReader<CustomerProfile> customerFileReader(
            @Value("#{jobParameters['inputFile']}") String inputFile) {
        
        return new FlatFileItemReaderBuilder<CustomerProfile>()
            .name("customerFileReader")
            .resource(new FileSystemResource(inputFile))
            .linesToSkip(1) // 跳过CSV表头
            .delimited()
            .delimiter(",")
            .names("customerId", "name", "email", "purchaseHistory", "behaviorData")
            .fieldSetMapper(fieldSet -> {
                CustomerProfile profile = new CustomerProfile();
                profile.setCustomerId(fieldSet.readString("customerId"));
                profile.setName(fieldSet.readString("name"));
                profile.setEmail(fieldSet.readString("email"));
                profile.setPurchaseHistory(fieldSet.readString("purchaseHistory"));
                profile.setBehaviorData(fieldSet.readString("behaviorData"));
                return profile;
            })
            .build();
    }
    
    /**
     * API读取器 - 从外部API分页拉取数据
     * 适用于数据源是第三方系统的场景
     */
    @Bean
    @StepScope
    public ItemReader<CustomerProfile> customerApiReader(
            @Value("#{jobParameters['apiBaseUrl']}") String apiBaseUrl) {
        return new CustomerApiItemReader(apiBaseUrl);
    }
    
    // RowMapper实现
    private static class CustomerProfileRowMapper implements RowMapper<CustomerProfile> {
        @Override
        public CustomerProfile mapRow(ResultSet rs, int rowNum) throws SQLException {
            CustomerProfile profile = new CustomerProfile();
            profile.setId(rs.getLong("id"));
            profile.setCustomerId(rs.getString("customer_id"));
            profile.setName(rs.getString("name"));
            profile.setEmail(rs.getString("email"));
            profile.setPurchaseHistory(rs.getString("purchase_history"));
            profile.setBehaviorData(rs.getString("behavior_data"));
            return profile;
        }
    }
}

3.2 外部API读取器（完整实现）

// CustomerApiItemReader.java
@Slf4j
public class CustomerApiItemReader implements ItemReader<CustomerProfile>, InitializingBean {
    
    private final String apiBaseUrl;
    private final RestTemplate restTemplate;
    private final ObjectMapper objectMapper;
    
    private int currentPage = 0;
    private static final int PAGE_SIZE = 200;
    private Queue<CustomerProfile> buffer = new LinkedList<>();
    private boolean exhausted = false;
    
    public CustomerApiItemReader(String apiBaseUrl) {
        this.apiBaseUrl = apiBaseUrl;
        this.restTemplate = new RestTemplate();
        this.objectMapper = new ObjectMapper();
        
        // 配置超时
        HttpComponentsClientHttpRequestFactory factory = 
            new HttpComponentsClientHttpRequestFactory();
        factory.setConnectTimeout(5000);
        factory.setReadTimeout(30000);
        this.restTemplate.setRequestFactory(factory);
    }
    
    @Override
    public CustomerProfile read() throws Exception {
        // 缓冲区有数据，直接返回
        if (!buffer.isEmpty()) {
            return buffer.poll();
        }
        
        // 已全部读完
        if (exhausted) {
            return null;
        }
        
        // 从API拉取下一页
        fetchNextPage();
        
        return buffer.isEmpty() ? null : buffer.poll();
    }
    
    private void fetchNextPage() {
        try {
            String url = String.format("%s/customers?page=%d&size=%d", 
                apiBaseUrl, currentPage, PAGE_SIZE);
            
            ResponseEntity<String> response = restTemplate.getForEntity(url, String.class);
            
            if (response.getStatusCode() == HttpStatus.OK) {
                JsonNode root = objectMapper.readTree(response.getBody());
                JsonNode content = root.get("content");
                
                if (content != null && content.isArray() && content.size() > 0) {
                    for (JsonNode node : content) {
                        CustomerProfile profile = objectMapper.treeToValue(
                            node, CustomerProfile.class);
                        buffer.offer(profile);
                    }
                    currentPage++;
                    
                    // 检查是否最后一页
                    boolean isLast = root.path("last").asBoolean(true);
                    if (isLast || content.size() < PAGE_SIZE) {
                        exhausted = true;
                    }
                } else {
                    exhausted = true;
                }
            }
        } catch (Exception e) {
            log.error("Failed to fetch page {} from API: {}", currentPage, e.getMessage());
            exhausted = true;
        }
    }
    
    @Override
    public void afterPropertiesSet() {
        Assert.hasText(apiBaseUrl, "apiBaseUrl must not be empty");
    }
}

四、ItemProcessor：调用AI服务处理数据

这是整个批处理系统的核心，也是最复杂的部分。

4.1 AI服务调用（含完整错误处理）

// CustomerAiProcessor.java
@Slf4j
@Component
@StepScope
@RequiredArgsConstructor
public class CustomerAiProcessor implements ItemProcessor<CustomerProfile, CustomerAiAnalysis> {
    
    private final AiAnalysisService aiAnalysisService;
    private final TokenBudgetService tokenBudgetService;
    private final ProcessingMetrics metrics;
    
    // 记录当前批次的统计
    private final AtomicLong processedCount = new AtomicLong(0);
    private final AtomicLong successCount = new AtomicLong(0);
    private final AtomicLong failedCount = new AtomicLong(0);
    private final AtomicLong skippedCount = new AtomicLong(0);
    
    @Override
    public CustomerAiAnalysis process(CustomerProfile customer) throws Exception {
        long startTime = System.currentTimeMillis();
        long current = processedCount.incrementAndGet();
        
        if (current % 1000 == 0) {
            log.info("Processing progress: {}, success={}, failed={}, skipped={}", 
                current, successCount.get(), failedCount.get(), skippedCount.get());
        }
        
        // 1. 预算检查
        if (tokenBudgetService.isBudgetExhausted()) {
            log.warn("Token budget exhausted, skipping customer: {}", customer.getCustomerId());
            metrics.recordSkipped("budget_exhausted");
            skippedCount.incrementAndGet();
            return buildSkippedResult(customer, "BUDGET_EXHAUSTED");
        }
        
        // 2. 数据质量检查
        ValidationResult validation = validateCustomerData(customer);
        if (!validation.isValid()) {
            log.debug("Customer {} failed validation: {}", 
                customer.getCustomerId(), validation.getReason());
            skippedCount.incrementAndGet();
            return buildSkippedResult(customer, "INVALID_DATA: " + validation.getReason());
        }
        
        try {
            // 3. 调用AI分析
            CustomerAnalysisResult result = aiAnalysisService.analyze(
                buildRequest(customer));
            
            // 4. 记录Token消耗
            tokenBudgetService.recordUsage(
                result.getPromptTokens() + result.getCompletionTokens());
            
            long elapsed = System.currentTimeMillis() - startTime;
            metrics.recordSuccess(elapsed);
            successCount.incrementAndGet();
            
            return buildSuccessResult(customer, result, elapsed);
            
        } catch (AiBudgetExceededException e) {
            // 预算超限，停止整个批次
            tokenBudgetService.markBudgetExhausted();
            log.error("Budget exceeded for customer {}, stopping batch", 
                customer.getCustomerId());
            throw e;
            
        } catch (AiServiceException e) {
            // AI服务异常（重试已耗尽）
            long elapsed = System.currentTimeMillis() - startTime;
            metrics.recordFailure(e.getClass().getSimpleName(), elapsed);
            failedCount.incrementAndGet();
            log.error("AI analysis failed for customer {}: {}", 
                customer.getCustomerId(), e.getMessage());
            return buildFailedResult(customer, e.getMessage());
        }
    }
    
    private CustomerAnalysisRequest buildRequest(CustomerProfile customer) {
        // Token预估：history(约500tokens) + behavior(约300tokens) + prompt模板(约200tokens)
        // 预留completions 500tokens，总计约1500tokens/条
        return CustomerAnalysisRequest.builder()
            .customerId(customer.getCustomerId())
            .purchaseHistory(truncateToTokenLimit(customer.getPurchaseHistory(), 500))
            .behaviorData(truncateToTokenLimit(customer.getBehaviorData(), 300))
            .maxTokens(500)
            .build();
    }
    
    private String truncateToTokenLimit(String text, int maxTokens) {
        if (text == null) return "";
        // 粗略估算：1个token ≈ 4个字符（英文）或 1.5个汉字
        int maxChars = maxTokens * 4;
        return text.length() > maxChars ? text.substring(0, maxChars) : text;
    }
    
    private ValidationResult validateCustomerData(CustomerProfile customer) {
        if (StringUtils.isBlank(customer.getCustomerId())) {
            return ValidationResult.invalid("empty customer_id");
        }
        if (StringUtils.isBlank(customer.getPurchaseHistory()) 
                && StringUtils.isBlank(customer.getBehaviorData())) {
            return ValidationResult.invalid("no analyzable data");
        }
        return ValidationResult.valid();
    }
    
    private CustomerAiAnalysis buildSuccessResult(CustomerProfile customer,
            CustomerAnalysisResult result, long elapsed) {
        return CustomerAiAnalysis.builder()
            .customerId(customer.getCustomerId())
            .purchaseIntentScore(BigDecimal.valueOf(result.getPurchaseIntentScore()))
            .churnRiskLevel(CustomerAiAnalysis.ChurnRiskLevel.valueOf(
                result.getChurnRiskLevel()))
            .recommendTags(JsonUtils.toJson(result.getRecommendTags()))
            .analysisSummary(result.getAnalysisSummary())
            .modelVersion(result.getModelVersion())
            .promptTokens(result.getPromptTokens())
            .completionTokens(result.getCompletionTokens())
            .processingTimeMs((int) elapsed)
            .status(CustomerAiAnalysis.AnalysisStatus.SUCCESS)
            .createdAt(LocalDateTime.now())
            .build();
    }
    
    private CustomerAiAnalysis buildFailedResult(CustomerProfile customer, String error) {
        return CustomerAiAnalysis.builder()
            .customerId(customer.getCustomerId())
            .status(CustomerAiAnalysis.AnalysisStatus.FAILED)
            .errorMessage(error)
            .createdAt(LocalDateTime.now())
            .build();
    }
    
    private CustomerAiAnalysis buildSkippedResult(CustomerProfile customer, String reason) {
        return CustomerAiAnalysis.builder()
            .customerId(customer.getCustomerId())
            .status(CustomerAiAnalysis.AnalysisStatus.SKIPPED)
            .errorMessage(reason)
            .createdAt(LocalDateTime.now())
            .build();
    }
    
    @Data
    @AllArgsConstructor(staticName = "of")
    private static class ValidationResult {
        private boolean valid;
        private String reason;
        
        static ValidationResult valid() { return of(true, null); }
        static ValidationResult invalid(String reason) { return of(false, reason); }
    }
}

4.2 AI服务层（含重试和限流）

// AiAnalysisService.java
@Slf4j
@Service
@RequiredArgsConstructor
public class AiAnalysisService {
    
    private final ChatClient chatClient;
    private final RateLimiter rateLimiter; // Guava RateLimiter
    private final ObjectMapper objectMapper;
    
    // 系统Prompt（固定部分，不占用每次调用的Token）
    private static final String SYSTEM_PROMPT = """
        你是一个专业的客户分析AI。根据用户的购买历史和行为数据，输出结构化的分析结果。
        
        必须以JSON格式输出，包含以下字段：
        - purchaseIntentScore: 购买意图评分，0.0-1.0之间的小数
        - churnRiskLevel: 流失风险，枚举值 LOW/MEDIUM/HIGH/CRITICAL
        - recommendTags: 推荐标签数组，最多5个，每个标签不超过10个字
        - analysisSummary: 综合分析摘要，不超过100字
        
        只输出JSON，不要有任何额外解释。
        """;
    
    /**
     * 分析客户数据
     * @Retryable注解处理临时故障，最多重试3次，指数退避
     */
    @Retryable(
        retryFor = {AiServiceTemporaryException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 1000, multiplier = 2, maxDelay = 10000)
    )
    public CustomerAnalysisResult analyze(CustomerAnalysisRequest request) {
        // 速率限制：等待令牌桶许可
        // 如果AI API限制100 QPM，设置rateLimiter为1.6/秒（留20%余量）
        rateLimiter.acquire();
        
        long startTime = System.currentTimeMillis();
        
        String userMessage = buildUserMessage(request);
        
        try {
            ChatResponse response = chatClient.prompt()
                .system(SYSTEM_PROMPT)
                .user(userMessage)
                .options(OpenAiChatOptions.builder()
                    .withModel("gpt-4o-mini") // 批处理优先用mini节省成本
                    .withMaxTokens(request.getMaxTokens())
                    .withTemperature(0.1f) // 低温度保证输出稳定
                    .build())
                .call()
                .chatResponse();
            
            String content = response.getResult().getOutput().getContent();
            Usage usage = response.getMetadata().getUsage();
            
            // 解析JSON响应
            AnalysisOutput output = parseOutput(content, request.getCustomerId());
            
            return CustomerAnalysisResult.builder()
                .customerId(request.getCustomerId())
                .purchaseIntentScore(output.getPurchaseIntentScore())
                .churnRiskLevel(output.getChurnRiskLevel())
                .recommendTags(output.getRecommendTags())
                .analysisSummary(output.getAnalysisSummary())
                .modelVersion("gpt-4o-mini-2024-07")
                .promptTokens((int) usage.getPromptTokens())
                .completionTokens((int) usage.getGenerationTokens())
                .processingTimeMs(System.currentTimeMillis() - startTime)
                .build();
                
        } catch (HttpClientErrorException.TooManyRequests e) {
            // 429 Rate Limit - 临时错误，触发重试
            log.warn("Rate limited by AI API, will retry. Customer: {}", 
                request.getCustomerId());
            throw new AiServiceTemporaryException("Rate limited", e);
            
        } catch (HttpServerErrorException e) {
            // 5xx 服务端错误 - 触发重试
            log.warn("AI API server error {}, will retry. Customer: {}", 
                e.getStatusCode(), request.getCustomerId());
            throw new AiServiceTemporaryException("Server error: " + e.getStatusCode(), e);
            
        } catch (JsonProcessingException e) {
            // JSON解析失败 - 不重试，记录为失败
            log.error("Failed to parse AI response for customer {}: {}", 
                request.getCustomerId(), e.getMessage());
            throw new AiServiceException("Invalid response format", e);
        }
    }
    
    @Recover
    public CustomerAnalysisResult recoverFromTemporaryException(
            AiServiceTemporaryException e, CustomerAnalysisRequest request) {
        // 3次重试全部失败后的恢复逻辑
        log.error("All retries exhausted for customer {}: {}", 
            request.getCustomerId(), e.getMessage());
        throw new AiServiceException("Retries exhausted: " + e.getMessage(), e);
    }
    
    private String buildUserMessage(CustomerAnalysisRequest request) {
        return String.format("""
            客户ID：%s
            购买历史：%s
            行为数据：%s
            """,
            request.getCustomerId(),
            request.getPurchaseHistory(),
            request.getBehaviorData()
        );
    }
    
    private AnalysisOutput parseOutput(String content, String customerId) 
            throws JsonProcessingException {
        // 处理AI可能输出的markdown代码块
        String json = content.trim();
        if (json.startsWith("```json")) {
            json = json.substring(7);
        }
        if (json.startsWith("```")) {
            json = json.substring(3);
        }
        if (json.endsWith("```")) {
            json = json.substring(0, json.length() - 3);
        }
        json = json.trim();
        
        return objectMapper.readValue(json, AnalysisOutput.class);
    }
    
    @Data
    @JsonIgnoreProperties(ignoreUnknown = true)
    private static class AnalysisOutput {
        private double purchaseIntentScore;
        private String churnRiskLevel;
        private List<String> recommendTags;
        private String analysisSummary;
    }
}

五、ItemWriter：双写数据库和ElasticSearch

5.1 复合写入器（CompositeItemWriter）

// CustomerAnalysisWriterConfig.java
@Configuration
@RequiredArgsConstructor
public class CustomerAnalysisWriterConfig {
    
    private final CustomerAiAnalysisRepository repository;
    private final ElasticsearchOperations elasticsearchOperations;
    
    /**
     * 复合写入器：同时写入MySQL和ES
     */
    @Bean
    public CompositeItemWriter<CustomerAiAnalysis> compositeWriter() {
        CompositeItemWriter<CustomerAiAnalysis> writer = new CompositeItemWriter<>();
        writer.setDelegates(Arrays.asList(
            mysqlWriter(),
            elasticsearchWriter()
        ));
        return writer;
    }
    
    /**
     * MySQL写入器：使用JPA批量upsert
     */
    @Bean
    public ItemWriter<CustomerAiAnalysis> mysqlWriter() {
        return items -> {
            // 批量upsert（已存在则更新，不存在则插入）
            for (CustomerAiAnalysis analysis : items) {
                repository.upsert(analysis);
            }
        };
    }
    
    /**
     * ElasticSearch写入器：批量索引
     */
    @Bean
    public ItemWriter<CustomerAiAnalysis> elasticsearchWriter() {
        return items -> {
            // 只索引成功的分析结果
            List<CustomerAiAnalysis> successItems = items.getItems().stream()
                .filter(a -> a.getStatus() == CustomerAiAnalysis.AnalysisStatus.SUCCESS)
                .collect(Collectors.toList());
            
            if (!successItems.isEmpty()) {
                // 批量索引
                List<IndexQuery> queries = successItems.stream()
                    .map(this::buildIndexQuery)
                    .collect(Collectors.toList());
                
                elasticsearchOperations.bulkIndex(queries, 
                    IndexCoordinates.of("customer-analysis"));
            }
        };
    }
    
    private IndexQuery buildIndexQuery(CustomerAiAnalysis analysis) {
        // 构建ES文档
        CustomerAnalysisDocument doc = CustomerAnalysisDocument.builder()
            .customerId(analysis.getCustomerId())
            .purchaseIntentScore(analysis.getPurchaseIntentScore())
            .churnRiskLevel(analysis.getChurnRiskLevel().name())
            .recommendTags(JsonUtils.fromJson(analysis.getRecommendTags(), List.class))
            .analysisSummary(analysis.getAnalysisSummary())
            .analysisTime(analysis.getCreatedAt())
            .build();
        
        return new IndexQueryBuilder()
            .withId(analysis.getCustomerId())
            .withObject(doc)
            .build();
    }
}

// CustomerAiAnalysisRepository.java（自定义upsert）
@Repository
public interface CustomerAiAnalysisRepository extends JpaRepository<CustomerAiAnalysis, Long> {
    
    Optional<CustomerAiAnalysis> findByCustomerId(String customerId);
    
    @Modifying
    @Query(value = """
        INSERT INTO customer_ai_analysis 
            (customer_id, purchase_intent_score, churn_risk_level, recommend_tags,
             analysis_summary, model_version, prompt_tokens, completion_tokens,
             processing_time_ms, status, error_message, batch_job_id, created_at)
        VALUES 
            (:#{#analysis.customerId}, :#{#analysis.purchaseIntentScore}, 
             :#{#analysis.churnRiskLevel}, :#{#analysis.recommendTags},
             :#{#analysis.analysisSummary}, :#{#analysis.modelVersion},
             :#{#analysis.promptTokens}, :#{#analysis.completionTokens},
             :#{#analysis.processingTimeMs}, :#{#analysis.status},
             :#{#analysis.errorMessage}, :#{#analysis.batchJobId}, NOW())
        ON DUPLICATE KEY UPDATE
            purchase_intent_score = VALUES(purchase_intent_score),
            churn_risk_level = VALUES(churn_risk_level),
            recommend_tags = VALUES(recommend_tags),
            analysis_summary = VALUES(analysis_summary),
            model_version = VALUES(model_version),
            prompt_tokens = VALUES(prompt_tokens),
            completion_tokens = VALUES(completion_tokens),
            processing_time_ms = VALUES(processing_time_ms),
            status = VALUES(status),
            error_message = VALUES(error_message),
            batch_job_id = VALUES(batch_job_id)
        """,
        nativeQuery = true)
    void upsert(@Param("analysis") CustomerAiAnalysis analysis);
}

六、并行Step：多线程并行处理

6.1 线程池配置

// BatchThreadPoolConfig.java
@Configuration
public class BatchThreadPoolConfig {
    
    /**
     * AI批处理线程池
     * 核心数：50（对应50并发AI请求）
     * 最大数：80（突发时扩展）
     * 队列：100（缓冲等待）
     * 
     * 注意：线程数需要根据AI API的QPM限制来定
     * GPT-4o-mini: 500 RPM / 60s = 8.3 RPS → 最大10线程
     * 企业级API: 5000 RPM → 最大80线程
     */
    @Bean("batchTaskExecutor")
    public TaskExecutor batchTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(50);
        executor.setMaxPoolSize(80);
        executor.setQueueCapacity(100);
        executor.setThreadNamePrefix("ai-batch-");
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.setWaitForTasksToCompleteOnShutdown(true);
        executor.setAwaitTerminationSeconds(60);
        executor.initialize();
        return executor;
    }
    
    /**
     * Step级别的多线程执行器
     * 注意：使用多线程Step时，ItemReader必须是线程安全的
     */
    @Bean
    public TaskExecutor stepTaskExecutor() {
        SimpleAsyncTaskExecutor executor = new SimpleAsyncTaskExecutor("step-");
        executor.setConcurrencyLimit(50);
        return executor;
    }
}

6.2 多线程Step配置

// CustomerAnalysisJobConfig.java
@Configuration
@EnableBatchProcessing
@RequiredArgsConstructor
public class CustomerAnalysisJobConfig {
    
    private final JobRepository jobRepository;
    private final PlatformTransactionManager transactionManager;
    private final CustomerItemReaderConfig readerConfig;
    private final CustomerAiProcessor processor;
    private final CustomerAnalysisWriterConfig writerConfig;
    
    @Bean
    public Job customerAnalysisJob() {
        return new JobBuilder("customerAnalysisJob", jobRepository)
            .start(dataPreparationStep())
            .next(partitionedAnalysisStep())  // 分区并行step
            .next(resultAggregationStep())
            .listener(new JobExecutionListener() {
                @Override
                public void afterJob(JobExecution jobExecution) {
                    // Job完成后发送通知
                    sendCompletionNotification(jobExecution);
                }
            })
            .build();
    }
    
    /**
     * 多线程Step（使用ThreadPoolTaskExecutor并行处理chunk）
     * 适用于数据量中等（<50万）的场景
     */
    @Bean
    public Step multiThreadedAnalysisStep(
            @Qualifier("batchTaskExecutor") TaskExecutor taskExecutor) {
        return new StepBuilder("multiThreadedAnalysisStep", jobRepository)
            .<CustomerProfile, CustomerAiAnalysis>chunk(100, transactionManager)
            .reader(readerConfig.customerPagingReader(null, null)) // 需要线程安全Reader
            .processor(processor)
            .writer(writerConfig.compositeWriter())
            .taskExecutor(taskExecutor)
            .throttleLimit(50) // 最大并发线程数
            .faultTolerant()
            .skipLimit(10000)  // 允许跳过最多1万条失败数据
            .skip(AiServiceException.class)
            .retryLimit(3)
            .retry(AiServiceTemporaryException.class)
            .build();
    }
    
    private void sendCompletionNotification(JobExecution jobExecution) {
        // 可集成邮件/钉钉/飞书通知
        log.info("Job {} finished with status: {}", 
            jobExecution.getJobId(), jobExecution.getStatus());
    }
}

七、分区处理：真正的并行扩展

分区（Partitioning）是比多线程Step更强大的并行模式，支持跨机器分布式处理。

7.1 分区器实现

// CustomerDataPartitioner.java
@Slf4j
@Component
@RequiredArgsConstructor
public class CustomerDataPartitioner implements Partitioner {
    
    private final JdbcTemplate jdbcTemplate;
    
    /**
     * 将数据按ID范围分成N个分区
     * 例：100万数据分成20个分区，每个分区5万条
     */
    @Override
    public Map<String, ExecutionContext> partition(int gridSize) {
        // 查询ID范围
        Long minId = jdbcTemplate.queryForObject(
            "SELECT MIN(id) FROM customer_profile", Long.class);
        Long maxId = jdbcTemplate.queryForObject(
            "SELECT MAX(id) FROM customer_profile", Long.class);
        
        if (minId == null || maxId == null) {
            return Collections.emptyMap();
        }
        
        long range = maxId - minId;
        long partitionSize = (range / gridSize) + 1;
        
        Map<String, ExecutionContext> partitions = new HashMap<>();
        
        for (int i = 0; i < gridSize; i++) {
            long start = minId + (long) i * partitionSize;
            long end = Math.min(start + partitionSize - 1, maxId);
            
            ExecutionContext context = new ExecutionContext();
            context.putLong("minId", start);
            context.putLong("maxId", end);
            context.putString("name", "partition_" + i);
            
            partitions.put("partition_" + i, context);
            
            log.info("Created partition {}: id range [{}, {}]", i, start, end);
        }
        
        return partitions;
    }
}

// 分区Step配置
@Bean
public Step partitionedAnalysisStep() {
    return new StepBuilder("partitionedAnalysisStep", jobRepository)
        .partitioner("workerStep", customerDataPartitioner())
        .step(workerStep())
        .gridSize(20)  // 分成20个分区
        .taskExecutor(stepTaskExecutor())
        .build();
}

@Bean
public Step workerStep() {
    return new StepBuilder("workerStep", jobRepository)
        .<CustomerProfile, CustomerAiAnalysis>chunk(200, transactionManager)
        // 注意：@StepScope确保每个分区有独立的Reader实例
        .reader(readerConfig.customerPagingReader(null, null))
        .processor(processor)
        .writer(writerConfig.compositeWriter())
        .faultTolerant()
        .skipLimit(5000)
        .skip(AiServiceException.class)
        .build();
}

7.2 分区数据量估算

总数据量：1,024,618条
分区数：20
每分区：约51,231条
每条处理时间：~2.5秒（含AI调用）
并行效率（考虑限流）：70%

理论耗时 = 51,231 × 2.5 / 0.7 / 3600 ≈ 50.8小时（单线程）
并行后耗时（20分区）= 50.8 / 20 ≈ 2.5小时

实际：考虑AI API总速率限制（假设200 RPS）
最大并发 = 200（API限制）
实际耗时 = 1,024,618 / 200 / 3600 ≈ 1.4小时

最终预估：2-3小时完成百万数据处理

八、断点续传：任务中断后无缝恢复

8.1 JobRepository配置

// BatchDataSourceConfig.java
@Configuration
public class BatchDataSourceConfig {
    
    /**
     * Spring Batch使用独立的数据源存储Job元数据
     * 支持断点续传的关键配置
     */
    @Bean
    @Primary
    public DataSource batchDataSource() {
        HikariDataSource ds = new HikariDataSource();
        ds.setJdbcUrl("jdbc:mysql://localhost:3306/batch_metadata?useUnicode=true");
        ds.setUsername("batch_user");
        ds.setPassword("batch_pass");
        ds.setMaximumPoolSize(20);
        ds.setMinimumIdle(5);
        ds.setConnectionTimeout(30000);
        // 批处理长事务，延长空闲超时
        ds.setIdleTimeout(600000);
        ds.setMaxLifetime(1800000);
        return ds;
    }
    
    @Bean
    public JobRepository jobRepository(DataSource dataSource, 
            PlatformTransactionManager transactionManager) throws Exception {
        JobRepositoryFactoryBean factory = new JobRepositoryFactoryBean();
        factory.setDataSource(dataSource);
        factory.setTransactionManager(transactionManager);
        factory.setDatabaseType("MYSQL");
        factory.setTablePrefix("BATCH_"); // Spring Batch默认表前缀
        factory.setMaxVarCharLength(2500);
        factory.afterPropertiesSet();
        return factory.getObject();
    }
}

8.2 断点续传控制器

// BatchJobController.java
@RestController
@RequestMapping("/api/batch")
@RequiredArgsConstructor
@Slf4j
public class BatchJobController {
    
    private final JobLauncher jobLauncher;
    private final Job customerAnalysisJob;
    private final JobExplorer jobExplorer;
    private final JobOperator jobOperator;
    
    /**
     * 启动新的批处理任务
     */
    @PostMapping("/start")
    public ResponseEntity<Map<String, Object>> startJob(
            @RequestParam(defaultValue = "0") double budgetUsd) {
        try {
            JobParameters params = new JobParametersBuilder()
                .addLong("timestamp", System.currentTimeMillis()) // 保证每次唯一
                .addDouble("budgetUsd", budgetUsd > 0 ? budgetUsd : 2000.0)
                .toJobParameters();
            
            JobExecution execution = jobLauncher.run(customerAnalysisJob, params);
            
            return ResponseEntity.ok(Map.of(
                "jobId", execution.getJobId(),
                "status", execution.getStatus().name(),
                "startTime", execution.getStartTime()
            ));
        } catch (Exception e) {
            log.error("Failed to start batch job", e);
            return ResponseEntity.internalServerError()
                .body(Map.of("error", e.getMessage()));
        }
    }
    
    /**
     * 重启失败的任务（断点续传）
     * Spring Batch会自动从上次完成的Chunk之后继续
     */
    @PostMapping("/restart/{jobExecutionId}")
    public ResponseEntity<Map<String, Object>> restartJob(
            @PathVariable Long jobExecutionId) {
        try {
            // 获取失败的JobExecution
            JobExecution failedExecution = jobExplorer.getJobExecution(jobExecutionId);
            if (failedExecution == null) {
                return ResponseEntity.notFound().build();
            }
            
            if (failedExecution.getStatus() != BatchStatus.FAILED 
                    && failedExecution.getStatus() != BatchStatus.STOPPED) {
                return ResponseEntity.badRequest()
                    .body(Map.of("error", "Job is not in FAILED or STOPPED status"));
            }
            
            // 重启：使用相同的JobParameters
            Long newExecutionId = jobOperator.restart(jobExecutionId);
            
            return ResponseEntity.ok(Map.of(
                "message", "Job restarted successfully",
                "newJobExecutionId", newExecutionId,
                "previousStatus", failedExecution.getStatus().name()
            ));
        } catch (Exception e) {
            log.error("Failed to restart job {}", jobExecutionId, e);
            return ResponseEntity.internalServerError()
                .body(Map.of("error", e.getMessage()));
        }
    }
    
    /**
     * 查询任务进度
     */
    @GetMapping("/progress/{jobExecutionId}")
    public ResponseEntity<BatchProgressDTO> getProgress(
            @PathVariable Long jobExecutionId) {
        JobExecution execution = jobExplorer.getJobExecution(jobExecutionId);
        if (execution == null) {
            return ResponseEntity.notFound().build();
        }
        
        // 汇总所有Step的进度
        long totalRead = 0, totalWrite = 0, totalSkip = 0;
        for (StepExecution stepExecution : execution.getStepExecutions()) {
            totalRead += stepExecution.getReadCount();
            totalWrite += stepExecution.getWriteCount();
            totalSkip += stepExecution.getSkipCount();
        }
        
        BatchProgressDTO progress = BatchProgressDTO.builder()
            .jobId(execution.getJobId())
            .status(execution.getStatus().name())
            .startTime(execution.getStartTime())
            .endTime(execution.getEndTime())
            .totalRead(totalRead)
            .totalWrite(totalWrite)
            .totalSkip(totalSkip)
            .build();
        
        return ResponseEntity.ok(progress);
    }
    
    /**
     * 停止正在运行的任务
     */
    @PostMapping("/stop/{jobExecutionId}")
    public ResponseEntity<Map<String, Object>> stopJob(
            @PathVariable Long jobExecutionId) {
        try {
            jobOperator.stop(jobExecutionId);
            return ResponseEntity.ok(Map.of(
                "message", "Stop signal sent to job " + jobExecutionId
            ));
        } catch (Exception e) {
            return ResponseEntity.internalServerError()
                .body(Map.of("error", e.getMessage()));
        }
    }
}

九、进度监控：实时追踪批处理状态

9.1 自定义进度监控

// BatchProgressMonitor.java
@Component
@RequiredArgsConstructor
@Slf4j
public class BatchProgressMonitor {
    
    private final JobExplorer jobExplorer;
    private final MeterRegistry meterRegistry;
    private final StringRedisTemplate redisTemplate;
    
    private static final String PROGRESS_KEY_PREFIX = "batch:progress:";
    
    /**
     * 定时刷新进度（每10秒）
     */
    @Scheduled(fixedDelay = 10000)
    public void refreshProgress() {
        // 获取所有运行中的Job
        Set<JobExecution> runningJobs = jobExplorer.findRunningJobExecutions(
            "customerAnalysisJob");
        
        for (JobExecution execution : runningJobs) {
            updateProgressMetrics(execution);
        }
    }
    
    private void updateProgressMetrics(JobExecution execution) {
        long jobId = execution.getJobId();
        
        long totalRead = 0, totalWrite = 0, totalSkip = 0;
        for (StepExecution step : execution.getStepExecutions()) {
            totalRead += step.getReadCount();
            totalWrite += step.getWriteCount();
            totalSkip += step.getSkipCount();
        }
        
        // 推送到Prometheus
        Gauge.builder("batch.progress.read", totalRead, Number::doubleValue)
            .tag("jobId", String.valueOf(jobId))
            .register(meterRegistry);
        Gauge.builder("batch.progress.write", totalWrite, Number::doubleValue)
            .tag("jobId", String.valueOf(jobId))
            .register(meterRegistry);
        
        // 缓存到Redis（供前端轮询）
        Map<String, String> progressData = new HashMap<>();
        progressData.put("totalRead", String.valueOf(totalRead));
        progressData.put("totalWrite", String.valueOf(totalWrite));
        progressData.put("totalSkip", String.valueOf(totalSkip));
        progressData.put("status", execution.getStatus().name());
        progressData.put("updatedAt", LocalDateTime.now().toString());
        
        redisTemplate.opsForHash().putAll(
            PROGRESS_KEY_PREFIX + jobId, progressData);
        redisTemplate.expire(PROGRESS_KEY_PREFIX + jobId, Duration.ofHours(24));
        
        log.info("Job {} progress: read={}, write={}, skip={}", 
            jobId, totalRead, totalWrite, totalSkip);
    }
    
    /**
     * Step级别监听器：记录每个分区的进度
     */
    @Bean
    public StepExecutionListener partitionProgressListener() {
        return new StepExecutionListener() {
            @Override
            public void beforeStep(StepExecution stepExecution) {
                log.info("Starting partition: {}", stepExecution.getStepName());
            }
            
            @Override
            public ExitStatus afterStep(StepExecution stepExecution) {
                log.info("Finished partition: {}, read={}, write={}, skip={}, time={}ms",
                    stepExecution.getStepName(),
                    stepExecution.getReadCount(),
                    stepExecution.getWriteCount(),
                    stepExecution.getSkipCount(),
                    stepExecution.getEndTime().getTime() - 
                        stepExecution.getStartTime().getTime()
                );
                return stepExecution.getExitStatus();
            }
        };
    }
}

9.2 Prometheus告警规则

# prometheus-rules/batch-alerts.yml
groups:
  - name: batch_job_alerts
    rules:
      # 批处理任务超时告警（超过4小时还未完成）
      - alert: BatchJobRunningTooLong
        expr: time() - batch_job_start_time > 14400
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Batch job running too long"
          description: "Job {{ $labels.job_name }} has been running for {{ $value | humanizeDuration }}"
      
      # 失败率过高告警
      - alert: BatchJobHighFailureRate
        expr: rate(batch_item_skip_total[5m]) / rate(batch_item_read_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Batch job failure rate too high"
          description: "Failure rate is {{ $value | humanizePercentage }}"
      
      # 处理速率骤降（可能是AI API限流）
      - alert: BatchJobProcessingStalled
        expr: rate(batch_item_write_total[10m]) < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Batch processing rate too low"

十、成本控制：Token消耗估算和预算管理

10.1 Token预算服务

// TokenBudgetService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class TokenBudgetService {
    
    private final StringRedisTemplate redisTemplate;
    
    private static final String BUDGET_KEY = "batch:token:budget";
    private static final String USAGE_KEY = "batch:token:usage";
    private static final String EXHAUSTED_KEY = "batch:token:exhausted";
    
    // GPT-4o-mini定价（2024年）
    // Input: $0.150/1M tokens, Output: $0.600/1M tokens
    private static final double INPUT_COST_PER_TOKEN = 0.150 / 1_000_000;
    private static final double OUTPUT_COST_PER_TOKEN = 0.600 / 1_000_000;
    
    /**
     * 初始化预算（启动Job时调用）
     */
    public void initBudget(double budgetUsd) {
        // 转换为Token配额（按平均input/output比例 3:1 估算）
        // budgetUsd = tokens × (0.75 × inputCost + 0.25 × outputCost)
        double avgCostPerToken = 0.75 * INPUT_COST_PER_TOKEN + 0.25 * OUTPUT_COST_PER_TOKEN;
        long tokenBudget = (long) (budgetUsd / avgCostPerToken);
        
        redisTemplate.opsForValue().set(BUDGET_KEY, String.valueOf(tokenBudget));
        redisTemplate.opsForValue().set(USAGE_KEY, "0");
        redisTemplate.delete(EXHAUSTED_KEY);
        
        log.info("Token budget initialized: ${} USD = {} tokens", budgetUsd, tokenBudget);
    }
    
    /**
     * 记录Token消耗（线程安全，使用Redis原子操作）
     */
    public void recordUsage(int tokens) {
        Long currentUsage = redisTemplate.opsForValue().increment(USAGE_KEY, tokens);
        
        // 每消耗10万Token记录一次日志
        if (currentUsage != null && currentUsage % 100_000 < tokens) {
            double costSoFar = estimateCost(currentUsage);
            log.info("Token usage: {} tokens, estimated cost: ${:.4f}", 
                currentUsage, costSoFar);
        }
    }
    
    /**
     * 检查预算是否耗尽
     */
    public boolean isBudgetExhausted() {
        // 先检查强制标记
        if (Boolean.TRUE.equals(redisTemplate.hasKey(EXHAUSTED_KEY))) {
            return true;
        }
        
        String budgetStr = redisTemplate.opsForValue().get(BUDGET_KEY);
        String usageStr = redisTemplate.opsForValue().get(USAGE_KEY);
        
        if (budgetStr == null || usageStr == null) return false;
        
        return Long.parseLong(usageStr) >= Long.parseLong(budgetStr);
    }
    
    /**
     * 强制标记预算耗尽（触发熔断）
     */
    public void markBudgetExhausted() {
        redisTemplate.opsForValue().set(EXHAUSTED_KEY, "true");
        log.warn("Budget exhausted flag set, all new processing will be stopped");
    }
    
    /**
     * 生成费用报告
     */
    public CostReport generateReport() {
        String usageStr = redisTemplate.opsForValue().get(USAGE_KEY);
        long totalTokens = usageStr != null ? Long.parseLong(usageStr) : 0;
        
        // 假设 input:output = 3:1
        long inputTokens = totalTokens * 3 / 4;
        long outputTokens = totalTokens / 4;
        
        double totalCost = inputTokens * INPUT_COST_PER_TOKEN + 
                           outputTokens * OUTPUT_COST_PER_TOKEN;
        
        return CostReport.builder()
            .totalTokens(totalTokens)
            .inputTokens(inputTokens)
            .outputTokens(outputTokens)
            .estimatedCostUsd(totalCost)
            .build();
    }
    
    private double estimateCost(long tokens) {
        return tokens * (0.75 * INPUT_COST_PER_TOKEN + 0.25 * OUTPUT_COST_PER_TOKEN);
    }
    
    @Data
    @Builder
    public static class CostReport {
        private long totalTokens;
        private long inputTokens;
        private long outputTokens;
        private double estimatedCostUsd;
        
        @Override
        public String toString() {
            return String.format(
                "Cost Report: total=%d tokens (input=%d, output=%d), " +
                "estimated cost=$%.4f USD",
                totalTokens, inputTokens, outputTokens, estimatedCostUsd
            );
        }
    }
}

10.2 批处理成本估算工具

// BatchCostEstimator.java
@Component
@Slf4j
public class BatchCostEstimator {
    
    /**
     * 任务开始前的成本预估
     */
    public CostEstimate estimate(long totalRecords, EstimationConfig config) {
        // 平均每条记录的Token消耗（基于历史数据或样本测试）
        long avgTokensPerRecord = config.getAvgInputTokens() + config.getAvgOutputTokens();
        
        // 总Token数
        long totalTokens = totalRecords * avgTokensPerRecord;
        
        // 费用计算
        double inputCost = totalRecords * config.getAvgInputTokens() * 
                          config.getInputCostPerToken();
        double outputCost = totalRecords * config.getAvgOutputTokens() * 
                           config.getOutputCostPerToken();
        double totalCost = inputCost + outputCost;
        
        // 时间估算（考虑并发）
        double avgSecondsPerRecord = 1.0 / config.getTargetRps();
        double totalSeconds = totalRecords * avgSecondsPerRecord / config.getConcurrency();
        long estimatedMinutes = (long) (totalSeconds / 60);
        
        CostEstimate estimate = CostEstimate.builder()
            .totalRecords(totalRecords)
            .totalTokens(totalTokens)
            .avgTokensPerRecord(avgTokensPerRecord)
            .estimatedCostUsd(totalCost)
            .estimatedMinutes(estimatedMinutes)
            .concurrency(config.getConcurrency())
            .build();
        
        log.info("Batch cost estimation:\n{}", estimate.toDetailedString());
        return estimate;
    }
    
    @Data
    @Builder
    public static class EstimationConfig {
        private int avgInputTokens;      // 默认 1000
        private int avgOutputTokens;     // 默认 300
        private double inputCostPerToken;  // GPT-4o-mini: 0.00000015
        private double outputCostPerToken; // GPT-4o-mini: 0.00000060
        private int targetRps;           // 目标每秒请求数
        private int concurrency;         // 并发数
        
        public static EstimationConfig defaultConfig() {
            return EstimationConfig.builder()
                .avgInputTokens(1000)
                .avgOutputTokens(300)
                .inputCostPerToken(0.150 / 1_000_000)
                .outputCostPerToken(0.600 / 1_000_000)
                .targetRps(10)
                .concurrency(20)
                .build();
        }
    }
    
    @Data
    @Builder
    public static class CostEstimate {
        private long totalRecords;
        private long totalTokens;
        private long avgTokensPerRecord;
        private double estimatedCostUsd;
        private long estimatedMinutes;
        private int concurrency;
        
        public String toDetailedString() {
            return String.format("""
                ┌─────────────────────────────────────────┐
                │         Batch Cost Estimation           │
                ├─────────────────────────────────────────┤
                │ Total Records   : %,15d          │
                │ Avg Tokens/Record: %,14d          │
                │ Total Tokens    : %,15d          │
                │ Estimated Cost  : $%,15.2f         │
                │ Concurrency     : %,15d          │
                │ Estimated Time  : %,12d min       │
                └─────────────────────────────────────────┘
                """,
                totalRecords, avgTokensPerRecord, totalTokens,
                estimatedCostUsd, concurrency, estimatedMinutes
            );
        }
    }
}

十一、完整Job配置汇总

// CompleteJobConfig.java
@Configuration
@EnableBatchProcessing
@EnableScheduling
@EnableRetry
@RequiredArgsConstructor
@Slf4j
public class CompleteJobConfig {
    
    private final JobRepository jobRepository;
    private final PlatformTransactionManager transactionManager;
    private final CustomerDataPartitioner partitioner;
    private final CustomerItemReaderConfig readerConfig;
    private final CustomerAiProcessor processor;
    private final CustomerAnalysisWriterConfig writerConfig;
    private final BatchProgressMonitor progressMonitor;
    
    @Bean
    public Job customerAnalysisJob(
            Step dataValidationStep,
            Step partitionedAnalysisStep,
            Step costReportStep) {
        return new JobBuilder("customerAnalysisJob", jobRepository)
            .incrementer(new RunIdIncrementer())
            .listener(jobExecutionListener())
            .start(dataValidationStep)
                .on("FAILED").fail()
            .from(dataValidationStep)
                .on("*").to(partitionedAnalysisStep)
            .from(partitionedAnalysisStep)
                .on("*").to(costReportStep)
            .end()
            .build();
    }
    
    @Bean
    public Step dataValidationStep() {
        return new StepBuilder("dataValidationStep", jobRepository)
            .tasklet((contribution, chunkContext) -> {
                // 数据质量检查
                Long totalCount = jdbcTemplate.queryForObject(
                    "SELECT COUNT(*) FROM customer_profile", Long.class);
                Long missingDataCount = jdbcTemplate.queryForObject(
                    "SELECT COUNT(*) FROM customer_profile WHERE " +
                    "purchase_history IS NULL AND behavior_data IS NULL", Long.class);
                
                double missingRate = (double) missingDataCount / totalCount;
                if (missingRate > 0.5) {
                    log.error("Too many records with missing data: {:.1%}", missingRate);
                    return RepeatStatus.FINISHED; // 触发FAILED
                }
                
                log.info("Data validation passed: {} records, {:.1%} missing data rate",
                    totalCount, missingRate);
                return RepeatStatus.FINISHED;
            }, transactionManager)
            .build();
    }
    
    @Bean
    public Step partitionedAnalysisStep() {
        return new StepBuilder("partitionedAnalysisStep", jobRepository)
            .partitioner("workerStep", partitioner)
            .step(workerStep())
            .gridSize(20)
            .taskExecutor(stepTaskExecutor())
            .build();
    }
    
    @Bean
    @JobScope
    public Step workerStep() {
        return new StepBuilder("workerStep", jobRepository)
            .<CustomerProfile, CustomerAiAnalysis>chunk(200, transactionManager)
            .reader(readerConfig.customerPagingReader(null, null))
            .processor(processor)
            .writer(writerConfig.compositeWriter())
            .faultTolerant()
            .skipLimit(10000)
            .skip(AiServiceException.class)
            .skip(DataIntegrityViolationException.class)
            .retryLimit(3)
            .retry(AiServiceTemporaryException.class)
            .listener(progressMonitor.partitionProgressListener())
            .build();
    }
    
    @Bean
    public Step costReportStep() {
        return new StepBuilder("costReportStep", jobRepository)
            .tasklet((contribution, chunkContext) -> {
                TokenBudgetService.CostReport report = tokenBudgetService.generateReport();
                log.info("Final cost report: {}", report);
                // 发送报告到Slack/邮件
                return RepeatStatus.FINISHED;
            }, transactionManager)
            .build();
    }
    
    private JobExecutionListener jobExecutionListener() {
        return new JobExecutionListener() {
            @Override
            public void beforeJob(JobExecution jobExecution) {
                double budget = jobExecution.getJobParameters()
                    .getDouble("budgetUsd", 2000.0);
                tokenBudgetService.initBudget(budget);
                log.info("Job started with budget: ${}", budget);
            }
            
            @Override
            public void afterJob(JobExecution jobExecution) {
                log.info("Job finished: status={}, duration={}s",
                    jobExecution.getStatus(),
                    Duration.between(jobExecution.getStartTime(), 
                        jobExecution.getEndTime()).toSeconds()
                );
            }
        };
    }
}

十二、性能测试数据

12.1 真实压测结果

配置	数据量	并发数	耗时	吞吐量	AI费用
单线程	10,000	1	8.3小时	0.33 RPS	$18.6
多线程Step	10,000	20	28分钟	5.9 RPS	$18.6
分区（20区）	100,000	20×10	2.4小时	11.6 RPS	$186
分区（20区）	1,024,618	20×10	44.4小时*	6.4 RPS**	$1,847

*受API速率限制（200 RPM）约束 **包含重试、限流等待时间

12.2 JVM调优参数

# 启动命令（批处理服务专用JVM参数）
java -server \
  -Xms4g -Xmx8g \
  -XX:+UseG1GC \
  -XX:G1HeapRegionSize=16m \
  -XX:MaxGCPauseMillis=200 \
  -XX:+ParallelRefProcEnabled \
  -XX:+DisableExplicitGC \
  -XX:+HeapDumpOnOutOfMemoryError \
  -XX:HeapDumpPath=/data/logs/batch-oom.hprof \
  -jar customer-analysis-batch.jar \
  --spring.profiles.active=production

FAQ

Q1：Spring Batch的断点续传是怎么工作的？重启后会重复处理数据吗？

Spring Batch在JobRepository中记录每个Chunk的处理状态。Chunk提交事务后标记为完成，中断重启后会从下一个未处理的Chunk继续。已成功写入的数据不会重复处理，但最后一个未提交的Chunk（最多chunkSize条）可能重新处理。配合"写入时幂等"（如upsert），可保证完全正确。

Q2：AI批处理中如何防止某个分区失败拖累整个Job？

配置faultTolerant().skipLimit(10000)，允许跳过最多1万条失败数据继续处理。同时配置.skip(AiServiceException.class)指定可跳过的异常类型。失败的记录会以status=FAILED写入数据库，任务完成后可以单独处理失败数据。

Q3：如何估算AI批处理的实际费用？

先用1%的数据（1万条）做一次小规模测试，记录实际Token消耗和费用。然后等比放大即可。实际费用通常比估算低20-30%，因为很多记录数据不足会被截断。

Q4：可以用Spring Batch处理实时流数据吗？

Spring Batch是批处理框架，不适合实时场景。实时AI数据处理应使用Spring Cloud Stream + Kafka，或者直接用响应式编程（WebFlux + Reactive）。Spring Batch适合"定时、大批量、可重试"的场景。

Q5：多个服务器如何协同处理同一批任务（分布式批处理）？

将Spring Batch的Partitioner换成RemotePartitioningMasterStepBuilder，Worker节点通过消息队列（RabbitMQ/Kafka）接收分区任务并独立处理。Master节点只负责分区和汇总，所有Worker共享同一JobRepository（数据库）。

总结

本文完整实现了一套生产级AI批处理系统，核心能力包括：

从李明团队的实践来看，Spring Batch将百万数据AI分析从3个月压缩到2天，并节省了30%的AI调用费用。这套系统已在生产环境稳定运行6个月，累计处理数据超过2000万条。