半结构化数据的 AI 处理——Excel 和 CSV 里的知识怎么进知识库

老张2026/4/30大约 12 分钟

半结构化数据的 AI 处理——Excel 和 CSV 里的知识怎么进知识库

有一次，一个做电商的朋友找我，说他们的 RAG 系统有个奇怪的问题：问"华为 Mate 60 Pro 的内存是多少"，系统给不出答案，但这个数据明明在产品参数表里。

我一看他们的知识库，发现是这么处理的：把 Excel 表格用 Python 的 pandas 直接读成文本，每一行变成一个字符串，像这样：

华为 Mate 60 Pro,6.82英寸,12GB,512GB,5000mAh,6999元

然后直接 embedding 存进向量库。

问题出在哪？用户问的是"华为 Mate 60 Pro 的内存是多少"，而向量库里存的是"华为 Mate 60 Pro,6.82英寸,12GB,512GB,5000mAh,6999元"，这串数字对 LLM 来说是没有语义的——它不知道 12GB 是内存还是存储，6999 是什么单位。

这就是表格数据处理的核心问题：表格的语义在行列的关系上，不在单元格的内容里。

表格数据和非结构化文本的本质差异

非结构化文本（文档、文章）的语义是线性的，顺序阅读就能理解。

表格数据的语义是二维的：

列头定义了每一列的含义（"内存"、"价格"）
行定义了一个实体的各个属性
单元格内容只有放在"行标题 + 列标题"的交叉点上才有意义

当你把表格行变成 12GB 这个孤立字符串，它的意义完全丢失了。

更复杂的情况是：

多级表头：列头有合并单元格，一个数据列对应两级标题（如"CPU"下面有"型号"和"主频"）
横向比较表：行是产品，列是属性，用于横向对比多个产品
纵向明细表：行是记录，列是字段，用于查询某一条记录的详情
混合表格：前几行是摘要，后面是明细，结构混合

这四种表格的处理方式完全不同，不能一刀切。

把表格关系转成 LLM 能理解的自然语言

核心思路：用自然语言重新描述每一行数据，把列头的语义嵌入到每条记录里。

原来的数据：

产品名称	屏幕尺寸	运行内存	存储容量	电池容量	价格
华为 Mate 60 Pro	6.82英寸	12GB	512GB	5000mAh	6999元

转换后的文本描述：

华为 Mate 60 Pro 的产品参数如下：
- 屏幕尺寸：6.82英寸
- 运行内存：12GB
- 存储容量：512GB
- 电池容量：5000mAh
- 价格：6999元

或者更紧凑的句子格式：

华为 Mate 60 Pro 是一款搭载 12GB 运行内存、512GB 存储容量的智能手机，屏幕尺寸 6.82英寸，电池容量 5000mAh，售价 6999元。

哪种格式更好？取决于查询场景：

属性查询（某个产品的某个参数）：列表格式更好，每个属性独占一行，向量匹配更精准
对比查询（哪个产品内存最大）：需要多行数据，需要特殊处理（后面会说）
描述性查询（介绍这个产品）：句子格式更自然

实战：产品参数表的完整 RAG 处理

我以一个真实场景来讲：电商平台的产品参数表，一张 Excel 有 2000 行（2000 个 SKU），30 列属性。

数据分析阶段

先理解数据结构，再决定处理策略。

@Service
public class ExcelAnalysisService {
    
    public ExcelTableStructure analyzeStructure(String filePath) throws IOException {
        ExcelTableStructure structure = new ExcelTableStructure();
        
        try (Workbook workbook = WorkbookFactory.create(new File(filePath))) {
            Sheet sheet = workbook.getSheetAt(0);
            
            // 检测表头行数（可能有多级表头）
            int headerRows = detectHeaderRows(sheet);
            structure.setHeaderRows(headerRows);
            
            // 提取列定义
            List<ColumnDefinition> columns = extractColumns(sheet, headerRows);
            structure.setColumns(columns);
            
            // 检测数据类型
            for (ColumnDefinition col : columns) {
                col.setDataType(detectColumnDataType(sheet, headerRows, col.getColumnIndex()));
            }
            
            // 统计数据行数
            structure.setDataRowCount(sheet.getLastRowNum() - headerRows);
            
            // 识别主键列（通常是产品名称、产品ID等唯一标识列）
            structure.setPrimaryKeyColumnIndex(detectPrimaryKeyColumn(columns));
        }
        
        return structure;
    }
    
    private int detectHeaderRows(Sheet sheet) {
        // 检查第一行和第二行是否有合并单元格
        // 有合并单元格通常意味着多级表头
        for (CellRangeAddress mergedRegion : sheet.getMergedRegions()) {
            if (mergedRegion.getFirstRow() == 0 && mergedRegion.getLastRow() > 0) {
                return mergedRegion.getLastRow() + 1;
            }
        }
        return 1;
    }
    
    private List<ColumnDefinition> extractColumns(Sheet sheet, int headerRows) {
        List<ColumnDefinition> columns = new ArrayList<>();
        Row firstRow = sheet.getRow(0);
        
        for (int i = 0; i < firstRow.getLastCellNum(); i++) {
            ColumnDefinition col = new ColumnDefinition();
            col.setColumnIndex(i);
            
            // 构建多级列名（如"CPU_型号"、"CPU_主频"）
            StringBuilder colName = new StringBuilder();
            for (int h = 0; h < headerRows; h++) {
                Row headerRow = sheet.getRow(h);
                Cell cell = headerRow.getCell(i);
                String cellValue = getCellStringValue(cell);
                
                if (!cellValue.isEmpty()) {
                    if (colName.length() > 0) colName.append("_");
                    colName.append(cellValue);
                }
            }
            col.setName(colName.toString());
            columns.add(col);
        }
        
        return columns;
    }
    
    private ColumnDataType detectColumnDataType(Sheet sheet, int headerRows, int colIndex) {
        // 采样前20行来判断数据类型
        int numericCount = 0;
        int textCount = 0;
        int sampleSize = Math.min(20, sheet.getLastRowNum() - headerRows);
        
        for (int i = headerRows; i < headerRows + sampleSize; i++) {
            Row row = sheet.getRow(i);
            if (row == null) continue;
            Cell cell = row.getCell(colIndex);
            if (cell == null) continue;
            
            if (cell.getCellType() == CellType.NUMERIC) {
                numericCount++;
            } else {
                textCount++;
            }
        }
        
        if (numericCount > textCount * 2) return ColumnDataType.NUMERIC;
        return ColumnDataType.TEXT;
    }
}

Excel 行转自然语言

@Service
public class ExcelToNaturalLanguageConverter {
    
    // 产品参数表专用转换器
    public String convertRowToDescription(Row dataRow, ExcelTableStructure structure, 
                                           String tableName, String tableDescription) {
        StringBuilder description = new StringBuilder();
        
        // 获取主键值（产品名称）
        String primaryKey = getCellValue(dataRow, structure.getPrimaryKeyColumnIndex());
        
        // 文档标题行
        description.append(primaryKey).append("的").append(tableName).append("参数：\n");
        
        // 遍历所有列，生成键值对描述
        for (ColumnDefinition col : structure.getColumns()) {
            if (col.getColumnIndex() == structure.getPrimaryKeyColumnIndex()) {
                continue;  // 跳过主键列
            }
            
            String value = getCellValue(dataRow, col.getColumnIndex());
            if (value == null || value.trim().isEmpty()) {
                continue;  // 跳过空值
            }
            
            // 处理单位（如果列名包含单位信息）
            String colName = col.getName();
            String valueWithUnit = appendUnit(value, colName);
            
            description.append("- ").append(colName).append("：").append(valueWithUnit).append("\n");
        }
        
        // 在描述末尾加上表格来源信息（有助于RAG时注明数据来源）
        description.append("\n[数据来源：").append(tableDescription).append("]");
        
        return description.toString();
    }
    
    // 生成横向对比摘要（用于支持对比类查询）
    public String generateComparisonSummary(List<Row> rows, ExcelTableStructure structure, 
                                             String comparisonColumn, List<String> keyColumns) {
        StringBuilder summary = new StringBuilder();
        summary.append("以下是各产品").append(comparisonColumn).append("的对比：\n\n");
        
        for (Row row : rows) {
            String name = getCellValue(row, structure.getPrimaryKeyColumnIndex());
            for (String keyCol : keyColumns) {
                int colIndex = structure.getColumnIndexByName(keyCol);
                if (colIndex >= 0) {
                    String value = getCellValue(row, colIndex);
                    summary.append("- ").append(name).append("的").append(keyCol)
                           .append("为").append(value).append("\n");
                }
            }
        }
        
        return summary.toString();
    }
    
    private String appendUnit(String value, String columnName) {
        // 常见单位映射
        Map<String, String> unitMap = Map.of(
            "价格", "元",
            "重量", "g",
            "电池", "mAh",
            "内存", "",  // GB已在值里
            "尺寸", ""   // 英寸已在值里
        );
        
        for (Map.Entry<String, String> entry : unitMap.entrySet()) {
            if (columnName.contains(entry.getKey()) && !entry.getValue().isEmpty()) {
                // 检查值里是否已经有单位
                if (!value.matches(".*[a-zA-Z元克千克]+.*")) {
                    return value + entry.getValue();
                }
            }
        }
        return value;
    }
    
    private String getCellValue(Row row, int colIndex) {
        Cell cell = row.getCell(colIndex);
        if (cell == null) return "";
        
        switch (cell.getCellType()) {
            case NUMERIC:
                if (DateUtil.isCellDateFormatted(cell)) {
                    return cell.getLocalDateTimeCellValue().toLocalDate().toString();
                }
                double numValue = cell.getNumericCellValue();
                // 避免 12.0 这样的浮点数显示
                if (numValue == Math.floor(numValue)) {
                    return String.valueOf((long) numValue);
                }
                return String.valueOf(numValue);
            case STRING:
                return cell.getStringCellValue().trim();
            case BOOLEAN:
                return String.valueOf(cell.getBooleanCellValue());
            default:
                return "";
        }
    }
}

完整 Excel -> 向量化处理逻辑

@Service
public class ExcelIngestionService {
    
    private final ExcelAnalysisService analysisService;
    private final ExcelToNaturalLanguageConverter converter;
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    
    @Autowired
    public ExcelIngestionService(ExcelAnalysisService analysisService,
                                  ExcelToNaturalLanguageConverter converter,
                                  VectorStore vectorStore,
                                  EmbeddingModel embeddingModel) {
        this.analysisService = analysisService;
        this.converter = converter;
        this.vectorStore = vectorStore;
        this.embeddingModel = embeddingModel;
    }
    
    public ExcelIngestionResult ingest(ExcelIngestionRequest request) throws IOException {
        ExcelIngestionResult result = new ExcelIngestionResult();
        result.setStartTime(LocalDateTime.now());
        
        // 1. 分析表格结构
        ExcelTableStructure structure = analysisService.analyzeStructure(request.getFilePath());
        log.info("表格分析完成：{}列，{}行数据", structure.getColumns().size(), structure.getDataRowCount());
        
        List<Document> documentsToStore = new ArrayList<>();
        
        try (Workbook workbook = WorkbookFactory.create(new File(request.getFilePath()))) {
            Sheet sheet = workbook.getSheetAt(0);
            int headerRows = structure.getHeaderRows();
            
            // 2. 逐行处理数据
            for (int rowIndex = headerRows; rowIndex <= sheet.getLastRowNum(); rowIndex++) {
                Row row = sheet.getRow(rowIndex);
                if (row == null || isEmptyRow(row)) continue;
                
                // 生成单行描述（用于单条记录查询）
                String rowDescription = converter.convertRowToDescription(
                    row, structure, request.getTableName(), request.getTableDescription()
                );
                
                // 构建文档元数据
                Map<String, Object> metadata = new HashMap<>();
                metadata.put("source", request.getFilePath());
                metadata.put("tableName", request.getTableName());
                metadata.put("rowIndex", rowIndex);
                metadata.put("primaryKey", getCellValue(row, structure.getPrimaryKeyColumnIndex()));
                metadata.put("dataType", "EXCEL_ROW");
                
                documentsToStore.add(new Document(rowDescription, metadata));
                
                // 批量提交（每100条提交一次）
                if (documentsToStore.size() >= 100) {
                    vectorStore.add(documentsToStore);
                    result.addProcessedCount(documentsToStore.size());
                    documentsToStore.clear();
                    log.info("已处理 {} 行", result.getProcessedCount());
                }
            }
            
            // 3. 生成对比摘要文档（支持横向对比查询）
            if (request.isGenerateComparisonSummaries() && structure.getDataRowCount() <= 500) {
                // 数据量不太大时，生成整体对比摘要
                List<Row> allDataRows = new ArrayList<>();
                for (int i = headerRows; i <= sheet.getLastRowNum(); i++) {
                    Row row = sheet.getRow(i);
                    if (row != null && !isEmptyRow(row)) {
                        allDataRows.add(row);
                    }
                }
                
                // 为每个重要列生成对比摘要
                for (String compareCol : request.getComparisonColumns()) {
                    String summary = converter.generateComparisonSummary(
                        allDataRows, structure, compareCol, request.getKeyColumns()
                    );
                    
                    Map<String, Object> summaryMeta = new HashMap<>();
                    summaryMeta.put("source", request.getFilePath());
                    summaryMeta.put("tableName", request.getTableName());
                    summaryMeta.put("dataType", "EXCEL_COMPARISON_SUMMARY");
                    summaryMeta.put("compareColumn", compareCol);
                    
                    documentsToStore.add(new Document(summary, summaryMeta));
                }
            }
            
            // 提交剩余文档
            if (!documentsToStore.isEmpty()) {
                vectorStore.add(documentsToStore);
                result.addProcessedCount(documentsToStore.size());
            }
        }
        
        result.setEndTime(LocalDateTime.now());
        result.setStatus("SUCCESS");
        return result;
    }
    
    private boolean isEmptyRow(Row row) {
        for (Cell cell : row) {
            if (cell != null && cell.getCellType() != CellType.BLANK) {
                String value = "";
                if (cell.getCellType() == CellType.STRING) {
                    value = cell.getStringCellValue().trim();
                }
                if (!value.isEmpty()) return false;
            }
        }
        return true;
    }
    
    private String getCellValue(Row row, int colIndex) {
        Cell cell = row.getCell(colIndex);
        if (cell == null) return "";
        if (cell.getCellType() == CellType.STRING) return cell.getStringCellValue().trim();
        if (cell.getCellType() == CellType.NUMERIC) return String.valueOf((long)cell.getNumericCellValue());
        return "";
    }
}

真实场景：产品参数表的 RAG 查询

光有入库还不够，查询端也要做相应的处理。

产品参数表有一个独特的查询特点：用户经常问的是"哪个产品满足某个条件"，这类问题需要跨行搜索，纯 vector search 搞不定。

我设计了一个混合查询策略：

@Service
public class ProductQueryService {
    
    private final VectorStore vectorStore;
    private final ChatClient chatClient;
    private final JdbcTemplate jdbcTemplate;  // 结构化查询备用
    
    public String query(String userQuestion) {
        // 1. 判断查询类型
        QueryType queryType = classifyQuery(userQuestion);
        
        return switch (queryType) {
            case SINGLE_PRODUCT_ATTRIBUTE -> handleSingleProductQuery(userQuestion);
            case COMPARISON -> handleComparisonQuery(userQuestion);
            case FILTER -> handleFilterQuery(userQuestion);
        };
    }
    
    private String handleSingleProductQuery(String question) {
        // 单产品属性查询：直接用向量检索
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(3)
        );
        
        return generateAnswer(question, docs);
    }
    
    private String handleComparisonQuery(String question) {
        // 对比查询：先检索对比摘要文档
        List<Document> summaryDocs = vectorStore.similaritySearch(
            SearchRequest.query(question)
                .withTopK(5)
                .withFilterExpression("dataType == 'EXCEL_COMPARISON_SUMMARY'")
        );
        
        // 再补充相关产品的详细信息
        List<Document> detailDocs = vectorStore.similaritySearch(
            SearchRequest.query(question)
                .withTopK(10)
                .withFilterExpression("dataType == 'EXCEL_ROW'")
        );
        
        List<Document> allDocs = new ArrayList<>(summaryDocs);
        allDocs.addAll(detailDocs);
        
        return generateAnswer(question, allDocs);
    }
    
    private String handleFilterQuery(String question) {
        // 条件筛选查询（如"价格在5000以内的手机有哪些"）
        // 这类查询纯向量搜索效果差，需要LLM辅助生成结构化查询条件
        
        // 先用LLM提取查询条件
        String conditionExtraction = chatClient.prompt()
            .user("""
                从以下问题中提取结构化查询条件，以JSON格式返回：
                问题：%s
                
                返回格式：
                {
                    "entity": "查询的实体类型",
                    "conditions": [
                        {"column": "列名", "operator": ">=/<=/=/contains", "value": "值"}
                    ]
                }
                """.formatted(question))
            .call()
            .content();
        
        // 解析条件，执行混合查询（向量 + 元数据过滤）
        // 实际实现需要根据向量库的能力来决定过滤方式
        // 这里以 metadata 过滤为例
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.query(question).withTopK(20)
        );
        
        // 用LLM从召回结果中筛选符合条件的
        return generateFilteredAnswer(question, docs, conditionExtraction);
    }
    
    private QueryType classifyQuery(String question) {
        // 关键词规则判断查询类型
        if (question.contains("比较") || question.contains("对比") || 
            question.contains("哪个好") || question.contains("区别")) {
            return QueryType.COMPARISON;
        }
        if (question.contains("有哪些") || question.contains("推荐") || 
            question.contains("以内") || question.contains("以上") || 
            question.contains("大于") || question.contains("小于")) {
            return QueryType.FILTER;
        }
        return QueryType.SINGLE_PRODUCT_ATTRIBUTE;
    }
    
    private String generateAnswer(String question, List<Document> context) {
        String contextText = context.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n---\n\n"));
        
        return chatClient.prompt()
            .system("你是一个产品信息查询助手，根据提供的产品参数信息回答用户问题。回答要准确、简洁。")
            .user("""
                用户问题：%s
                
                产品参数信息：
                %s
                
                请根据以上信息回答用户问题。如果信息中没有相关数据，请明确说明。
                """.formatted(question, contextText))
            .call()
            .content();
    }
    
    private String generateFilteredAnswer(String question, List<Document> docs, String conditions) {
        String contextText = docs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n---\n\n"));
        
        return chatClient.prompt()
            .user("""
                用户问题：%s
                提取的查询条件：%s
                
                以下是候选产品信息（请从中筛选符合条件的）：
                %s
                
                请根据查询条件筛选并列出符合条件的产品。
                """.formatted(question, conditions, contextText))
            .call()
            .content();
    }
    
    enum QueryType {
        SINGLE_PRODUCT_ATTRIBUTE, COMPARISON, FILTER
    }
}

CSV 的特殊处理

CSV 比 Excel 简单，但也有自己的坑：

没有列宽信息：无法通过格式判断哪些是标题行
编码问题：中文 CSV 经常出现 GBK/UTF-8 混乱
数值格式不统一：有的 "1,234.56"（带千分符），有的 "1234.56"

@Service
public class CsvIngestionService {
    
    public List<Document> processCsv(String filePath, CsvIngestionConfig config) throws IOException {
        List<Document> documents = new ArrayList<>();
        
        // 自动检测编码
        Charset charset = detectCharset(filePath);
        
        try (Reader reader = new InputStreamReader(new FileInputStream(filePath), charset);
             CSVParser parser = new CSVParser(reader, CSVFormat.DEFAULT
                 .withFirstRecordAsHeader()
                 .withIgnoreEmptyLines()
                 .withTrim())) {
            
            List<String> headers = new ArrayList<>(parser.getHeaderNames());
            
            for (CSVRecord record : parser) {
                // 构建自然语言描述
                StringBuilder description = new StringBuilder();
                
                // 主键字段作为标题
                String primaryKey = config.getPrimaryKeyColumn() != null 
                    ? record.get(config.getPrimaryKeyColumn()) 
                    : record.get(0);
                description.append(primaryKey).append("：\n");
                
                for (String header : headers) {
                    if (header.equals(config.getPrimaryKeyColumn())) continue;
                    
                    String value = record.get(header);
                    if (value != null && !value.trim().isEmpty()) {
                        // 清洗数值（去除千分符）
                        value = cleanNumericValue(value);
                        description.append("- ").append(header).append("：").append(value).append("\n");
                    }
                }
                
                Map<String, Object> metadata = new HashMap<>();
                metadata.put("source", filePath);
                metadata.put("primaryKey", primaryKey);
                metadata.put("dataType", "CSV_ROW");
                
                documents.add(new Document(description.toString(), metadata));
            }
        }
        
        return documents;
    }
    
    private Charset detectCharset(String filePath) throws IOException {
        byte[] bytes = Files.readAllBytes(Paths.get(filePath));
        
        // BOM检测
        if (bytes.length >= 3 && bytes[0] == (byte)0xEF && bytes[1] == (byte)0xBB && bytes[2] == (byte)0xBF) {
            return StandardCharsets.UTF_8;
        }
        
        // 尝试UTF-8解码
        try {
            new String(bytes, StandardCharsets.UTF_8);
            return StandardCharsets.UTF_8;
        } catch (Exception e) {
            return Charset.forName("GBK");
        }
    }
    
    private String cleanNumericValue(String value) {
        // 去除千分符
        return value.replaceAll("(?<=\\d),(?=\\d{3})", "");
    }
}

Mermaid 图：Excel/CSV 向量化处理流程

一些踩坑记录

坑一：超大 Excel 的内存问题

Excel 有上万行时，如果用 WorkbookFactory.create() 一次性加载整个 workbook，内存会爆。要用 SXSSF（流式读取）：

// 大文件用流式读取
try (OPCPackage pkg = OPCPackage.open(new File(filePath))) {
    XSSFReader reader = new XSSFReader(pkg);
    // 使用 SAX 解析器逐行处理
    // ...
}

坑二：合并单元格的陷阱

Excel 里合并单元格，只有左上角那个格子有值，其他格子是空的。读取时要特别处理：

// 检查当前格子是否在合并区域内，是的话取合并区域左上角的值
private String getMergedCellValue(Sheet sheet, int row, int col) {
    for (CellRangeAddress range : sheet.getMergedRegions()) {
        if (range.isInRange(row, col)) {
            Row firstRow = sheet.getRow(range.getFirstRow());
            Cell firstCell = firstRow.getCell(range.getFirstColumn());
            return getCellStringValue(firstCell);
        }
    }
    return getCellStringValue(sheet.getRow(row).getCell(col));
}

坑三：数值型列存的是文字

有些 Excel，"价格"列里存的是"6999元"（字符串），有些存的是 6999（数字）。要做好兼容处理。

总结

表格数据进知识库，核心是把行列关系的语义显式化。不能把一行数据当普通文本对待。

我总结的处理原则：

先分析结构：搞清楚是产品参数表、对比表还是明细表，处理方式不同
列名即语义：每个字段的值必须和列名一起存储
生成摘要文档：在行级别之外，生成汇总级别的文档，支持对比类查询
查询端分类处理：单条查询、对比查询、条件筛选，用不同的检索策略

做好这几点，Excel 里的知识就能被 AI 真正理解和利用了。