ElasticSearch倒排索引：分词、评分、聚合的原理与Java API实战

老张大约 8 分钟

ElasticSearch倒排索引：分词、评分、聚合的原理与Java API实战

适读人群：中高级Java工程师 | 阅读时长：约20分钟 | 技术栈：Spring Boot 3.x、Elasticsearch 8.x、Spring Data Elasticsearch

开篇故事

2020年，我们把商品搜索从 MySQL LIKE 查询迁移到了 Elasticsearch。迁移前，SELECT * FROM product WHERE name LIKE '%手机%' 在 500 万商品数据下，查询时间约 3 秒。迁移后，同样的搜索只需 30ms，快了整整 100 倍。

但迁移后第一周，我们发现了一个让用户困惑的问题：搜索"苹果手机"，会搜出"苹果"（水果）相关的商品，排在了"iPhone 手机"前面。原因是 ES 的默认中文分词把"苹果手机"分成了["苹果", "手机"]，然后计算 TF-IDF 评分时，卖水果的商家名字里"苹果"这个词出现频率很高，评分反而比手机商品更高。

我们当时使用的是 ES 默认的 standard 分词器，对中文根本不适用。改用 IK 分词器后，"苹果手机"被正确分成["苹果手机"]（IK 能识别出这是一个整体词），搜索结果质量大幅提升。

这次经历让我深刻理解了 ES 中分词、评分机制的重要性。

一、核心原理：倒排索引

正向索引 vs 倒排索引

正向索引：文档ID -> 文档内容。查询"手机"需要扫描所有文档，效率低。

倒排索引：词项 -> 包含该词项的文档ID列表（Posting List）。查询"手机"直接定位到"手机"这个词项，获取所有包含"手机"的文档列表，效率极高。

倒排索引的 Posting List 不只是文档 ID，还存储了词频（TF）、位置信息（Pos）、词项在文档中的偏移量，用于评分和高亮显示。

TF-IDF 与 BM25 评分

ES 7.0 之前默认用 TF-IDF，7.0 之后改为 BM25（BM25 是 TF-IDF 的改进版）。

TF（词频）：词在文档中出现的次数，出现越多相关性越高。 IDF（逆文档频率）：词在所有文档中出现越少，越能区分文档，权重越高。

BM25 对 TF 做了饱和处理（词频超过某个值后增益减小）并引入了文档长度归一化（避免长文档因为词更多而占优势）。

分词器的工作流程

CharFilter：预处理，如去除 HTML 标签、转换特殊字符。 Tokenizer：核心分词，将文本切分为 token 列表。 TokenFilter：对 token 做后处理，如转小写、去停用词、词干提取。

二、完整代码实现

索引创建（IK 分词器 + 自定义映射）

@Configuration
public class ElasticsearchIndexConfig {

    @Autowired
    private ElasticsearchClient esClient;

    /**
     * 创建商品索引（包含中文 IK 分词和多字段支持）
     */
    @PostConstruct
    public void createProductIndex() throws IOException {
        String indexName = "products";

        // 检查索引是否已存在
        boolean exists = esClient.indices().exists(e -> e.index(indexName)).value();
        if (exists) {
            return;
        }

        esClient.indices().create(c -> c
            .index(indexName)
            .settings(s -> s
                .numberOfShards("3")     // 3个分片
                .numberOfReplicas("1")   // 1个副本
                .analysis(a -> a
                    // 自定义分析器：IK + 停用词过滤
                    .analyzer("ik_smart_custom", an -> an
                        .custom(cu -> cu
                            .tokenizer("ik_smart")
                            .filter("lowercase", "stop_filter")
                        )
                    )
                    .filter("stop_filter", f -> f
                        .definition(d -> d
                            .stop(st -> st
                                .stopwords("的", "了", "在", "是", "我", "有", "和", "就")
                            )
                        )
                    )
                )
            )
            .mappings(m -> m
                .properties("id", p -> p.keyword(k -> k))
                .properties("name", p -> p
                    .text(t -> t
                        .analyzer("ik_max_word")    // 索引时细粒度分词
                        .searchAnalyzer("ik_smart") // 搜索时粗粒度分词
                        .fields("keyword", f -> f.keyword(k -> k)) // 精确匹配子字段
                    )
                )
                .properties("category", p -> p.keyword(k -> k))
                .properties("description", p -> p
                    .text(t -> t.analyzer("ik_smart_custom"))
                )
                .properties("price", p -> p.double_(d -> d))
                .properties("stock", p -> p.integer(i -> i))
                .properties("createTime", p -> p.date(d -> d.format("yyyy-MM-dd HH:mm:ss")))
                .properties("tags", p -> p.keyword(k -> k)) // 标签，精确匹配
            )
        );
    }
}

商品 CRUD

@Service
@Slf4j
public class ProductSearchService {

    @Autowired
    private ElasticsearchClient esClient;

    private static final String INDEX_NAME = "products";

    /**
     * 新增/更新商品
     */
    public void indexProduct(Product product) throws IOException {
        esClient.index(i -> i
            .index(INDEX_NAME)
            .id(product.getId().toString())
            .document(product)
        );
    }

    /**
     * 批量索引（推荐大批量操作使用 bulk）
     */
    public void bulkIndexProducts(List<Product> products) throws IOException {
        List<BulkOperation> operations = products.stream()
            .map(product -> BulkOperation.of(b -> b
                .index(i -> i
                    .index(INDEX_NAME)
                    .id(product.getId().toString())
                    .document(product)
                )
            ))
            .collect(Collectors.toList());

        BulkResponse response = esClient.bulk(b -> b.operations(operations));

        if (response.errors()) {
            log.error("批量索引有失败项");
            response.items().stream()
                .filter(item -> item.error() != null)
                .forEach(item -> log.error("索引失败：id={}, error={}",
                    item.id(), item.error().reason()));
        }

        log.info("批量索引完成，成功={}, 耗时={}ms",
            products.size() - response.items().stream().filter(i -> i.error() != null).count(),
            response.took());
    }

    /**
     * 全文搜索（多字段 + 中文分词 + 高亮）
     */
    public SearchResult<Product> search(ProductSearchRequest request) throws IOException {
        SearchResponse<Product> response = esClient.search(s -> s
            .index(INDEX_NAME)
            .from((request.getPage() - 1) * request.getSize())
            .size(request.getSize())
            .query(q -> q
                .bool(b -> {
                    // 全文搜索
                    if (request.getKeyword() != null && !request.getKeyword().isBlank()) {
                        b.must(m -> m
                            .multiMatch(mm -> mm
                                .query(request.getKeyword())
                                .fields("name^3", "description^1") // name 权重是 description 的 3 倍
                                .type(TextQueryType.BestFields)
                                .minimumShouldMatch("75%")
                            )
                        );
                    }
                    // 类目过滤
                    if (request.getCategoryId() != null) {
                        b.filter(f -> f
                            .term(t -> t.field("category").value(request.getCategoryId()))
                        );
                    }
                    // 价格范围过滤
                    if (request.getMinPrice() != null || request.getMaxPrice() != null) {
                        b.filter(f -> f
                            .range(r -> {
                                r.field("price");
                                if (request.getMinPrice() != null) {
                                    r.gte(JsonData.of(request.getMinPrice()));
                                }
                                if (request.getMaxPrice() != null) {
                                    r.lte(JsonData.of(request.getMaxPrice()));
                                }
                                return r;
                            })
                        );
                    }
                    return b;
                })
            )
            // 高亮
            .highlight(h -> h
                .fields("name", f -> f
                    .preTags("<em class='highlight'>")
                    .postTags("</em>")
                    .numberOfFragments(1)
                )
                .fields("description", f -> f
                    .preTags("<em class='highlight'>")
                    .postTags("</em>")
                    .numberOfFragments(2)
                    .fragmentSize(100)
                )
            )
            // 排序
            .sort(so -> {
                if ("price_asc".equals(request.getSortBy())) {
                    return so.field(f -> f.field("price").order(SortOrder.Asc));
                } else if ("price_desc".equals(request.getSortBy())) {
                    return so.field(f -> f.field("price").order(SortOrder.Desc));
                } else {
                    return so.score(sc -> sc.order(SortOrder.Desc)); // 默认按相关性排序
                }
            }),
            Product.class
        );

        // 处理结果
        List<ProductHitResult> hits = response.hits().hits().stream()
            .map(hit -> ProductHitResult.builder()
                .product(hit.source())
                .score(hit.score())
                .highlights(hit.highlight())
                .build()
            )
            .collect(Collectors.toList());

        return SearchResult.<Product>builder()
            .hits(hits)
            .total(response.hits().total().value())
            .took(response.took())
            .build();
    }

    /**
     * 聚合查询（按类目统计商品数量 + 价格统计）
     */
    public CategoryAggResult aggregateByCategory(String keyword) throws IOException {
        SearchResponse<Void> response = esClient.search(s -> s
            .index(INDEX_NAME)
            .size(0) // 不需要返回文档，只要聚合结果
            .query(q -> q
                .match(m -> m.field("name").query(keyword))
            )
            .aggregations("category_count", a -> a
                .terms(t -> t.field("category").size(20))
                .aggregations("avg_price", aa -> aa
                    .avg(avg -> avg.field("price"))
                )
                .aggregations("price_range", aa -> aa
                    .range(r -> r
                        .field("price")
                        .ranges(
                            rv -> rv.to(100.0),
                            rv -> rv.from(100.0).to(500.0),
                            rv -> rv.from(500.0).to(2000.0),
                            rv -> rv.from(2000.0)
                        )
                    )
                )
            ),
            Void.class
        );

        // 解析聚合结果
        StringTermsAggregate categoryAgg = response.aggregations()
            .get("category_count").sterms();

        List<CategoryStat> stats = categoryAgg.buckets().array().stream()
            .map(bucket -> {
                double avgPrice = bucket.aggregations()
                    .get("avg_price").avg().value();
                return CategoryStat.builder()
                    .category(bucket.key().stringValue())
                    .count(bucket.docCount())
                    .avgPrice(avgPrice)
                    .build();
            })
            .collect(Collectors.toList());

        return CategoryAggResult.builder().stats(stats).build();
    }
}

自动补全（Suggest）

/**
 * 搜索词自动补全
 */
public List<String> suggest(String prefix) throws IOException {
    SearchResponse<Product> response = esClient.search(s -> s
        .index(INDEX_NAME)
        .suggest(su -> su
            .suggesters("name_suggest", sg -> sg
                .prefix(prefix)
                .completion(c -> c
                    .field("name_suggest") // 需要 completion 类型的字段
                    .size(10)
                    .skipDuplicates(true)
                )
            )
        ),
        Product.class
    );

    return response.suggest().get("name_suggest").stream()
        .flatMap(suggestion -> suggestion.completion().options().stream())
        .map(option -> option.text())
        .collect(Collectors.toList());
}

三、生产调优

分词效果调试

/**
 * 分析分词效果（调试用）
 */
public List<String> analyzeText(String text, String analyzer) throws IOException {
    AnalyzeResponse response = esClient.indices().analyze(a -> a
        .index(INDEX_NAME)
        .analyzer(analyzer)
        .text(text)
    );

    return response.tokens().stream()
        .map(token -> token.token())
        .collect(Collectors.toList());
}

生产常用的 IK 分词器调用示例：

ik_smart：粗粒度分词，搜索时使用
ik_max_word：细粒度分词，索引时使用

自定义词典（在 IK 配置目录创建 custom.dic 文件）可以添加业务词汇，让"苹果手机"作为一个整体词被识别。

评分调优

// Function Score Query：自定义评分规则
esClient.search(s -> s
    .query(q -> q
        .functionScore(fs -> fs
            .query(innerQ -> innerQ
                .multiMatch(mm -> mm.query(keyword).fields("name^3", "description"))
            )
            // 新品权重加分
            .functions(f -> f
                .filter(filter -> filter
                    .range(r -> r.field("createTime")
                        .gte(JsonData.of(LocalDate.now().minusDays(7).toString())))
                )
                .weight(1.5)
            )
            // 热门商品（浏览量）加分
            .functions(f -> f
                .fieldValueFactor(fvf -> fvf
                    .field("viewCount")
                    .factor(0.1)
                    .modifier(FieldValueFactorModifier.Log1p)
                    .missing(0.0)
                )
            )
            .boostMode(FunctionBoostMode.Multiply)
        )
    )
    .index(INDEX_NAME),
    Product.class
);

四、踩坑实录

坑一：默认分词器对中文的灾难（开篇故事）

默认 standard 分词器将中文按字切分，"苹果手机"变成["苹", "果", "手", "机"]，语义完全丢失。接入 IK 分词器后，搜索质量显著提升。同时要维护自定义词典，将新品牌名、行业术语加入词典，IK 才能正确识别。

坑二：Mapping 修改限制

ES 索引的 Mapping 一旦创建，字段类型不能修改（只能增加新字段）。有一次我们把商品 ID 定义成了 long 类型，后来业务变更需要支持字符串 ID，只能重建索引（新建一个索引，迁移数据，再切换别名），影响了将近半天的正常使用。

教训：ID 字段务必用 keyword 类型，灵活性最好。

坑三：深度分页性能问题

ES 的 from + size 翻页在深页时性能极差，from=10000, size=10 需要在每个分片上取 10010 条再合并排序。10 个分片就需要获取和排序 100100 条数据。

解决方案：

用 search_after（游标式分页），以最后一条记录的排序值作为下一页的起点，O(1) 开销
用 scroll API（用于数据导出，不是用户翻页）

一、倒排索引，让全文搜索从 O(N) 变成近似 O(1)，性能是 MySQL LIKE 的 100 倍量级。二、中文分词必须用 IK，默认分词器对中文不适用，是业务搜索质量的核心。三、BM25 评分可以通过 boost、functionScore 等手段调整，让评分与业务相关性对齐。四、聚合（Aggregation）是 ES 在统计分析场景的杀手锏，按类目/价格/品牌统计，毫秒级响应。五、Mapping 设计要前置，字段类型改不了，一次定清楚，避免后期重建索引的痛苦。