智能搜索引擎:用语义搜索替代传统关键词搜索
智能搜索引擎:用语义搜索替代传统关键词搜索
一、真实故事:那个"搜不到连衣裙"的电商平台
2025年双11前夕,某头部电商技术团队召开了一场紧急复盘会议。
会议室里,产品经理小林把数据大屏切给所有人看:搜索"红色裙子"的用户,有34%在搜索结果页直接流失。更让人头疼的是,仓库里明明有8000条"玫瑰色连衣裙"、"酒红旗袍"、"红酒色吊带裙",但这些商品的自然搜索曝光量几乎为零。
负责搜索的Java工程师陈磊拉出Elasticsearch日志,一看就明白了:
Query: "红色裙子"
Match: {"match": {"title": "红色裙子"}}
Result count: 127 items
Miss: "玫瑰色连衣裙" × "酒红旗袍" × "樱桃红吊带裙" ×传统关键词搜索必须做到词汇完全匹配。用户说"红色",系统就找"红色";用户说"裙子",系统就找"裙子"。"玫瑰色"和"红色"在字面上没有任何交集——尽管任何人类都知道它们是同一种颜色范畴。
陈磊花了3周时间,用Spring AI + PGVector改造了搜索系统。改造上线后的第一个完整月:
- 搜索转化率:从6.2% → 8.7%,提升40%
- 搜索零结果率:从18% → 4%,下降78%
- 用户搜索满意度(NPS):从31 → 58
这篇文章,就是他整个改造过程的技术复盘。
二、关键词搜索的根本局限:词汇鸿沟问题
2.1 什么是词汇鸿沟
"词汇鸿沟"(Vocabulary Gap)是信息检索领域的经典难题:用户用来描述需求的词汇,和文档中实际使用的词汇之间存在不匹配。
用户搜索词 文档中的词汇
────────────────────────────────────
红色裙子 ≠ 玫瑰色连衣裙
手机 ≠ 智能手机、移动电话
笔记本电脑 ≠ laptop、便携电脑
跑步鞋 ≠ 运动鞋、慢跑鞋2.2 传统解决方案及其局限
方案1:同义词词典
红色 → 朱红、玫瑰色、酒红、樱桃红、砖红...问题:词典维护成本极高,需要运营人工更新,新词无法自动覆盖。
方案2:TF-IDF + BM25
BM25是Elasticsearch默认的相关性算法,本质上是词频统计:
BM25 Score(q, d) = Σ IDF(qi) × [f(qi,d) × (k1+1)] / [f(qi,d) + k1 × (1 - b + b × |d|/avgdl)]这个公式再精妙,也无法解决"词不一样"的根本问题。
方案3:拼音搜索、模糊搜索
只能解决打字错误,无法理解语义相似性。
2.3 词汇鸿沟的量化影响
根据陈磊团队的数据分析:
| 问题类型 | 占搜索失败比例 |
|---|---|
| 同义词不匹配(红色 vs 玫瑰色) | 34% |
| 上下位词(裙子 vs 连衣裙) | 28% |
| 属性描述差异(显瘦 vs 修身) | 21% |
| 纯打字错误 | 17% |
超过83%的搜索失败,来自语义层面的不匹配,而非拼写错误。这正是语义搜索要解决的核心问题。
三、语义搜索原理:Embedding空间的相似度计算
3.1 从词袋到向量空间
传统搜索把文本看作词的集合(词袋模型):
"红色裙子" → {红色:1, 裙子:1}
"玫瑰色连衣裙" → {玫瑰色:1, 连衣裙:1}两个集合完全没有交集,相似度为0。
语义搜索把文本转换为高维向量(Embedding):
"红色裙子" → [0.23, -0.15, 0.87, 0.42, ..., 0.31] // 1536维
"玫瑰色连衣裙" → [0.25, -0.13, 0.84, 0.39, ..., 0.29] // 1536维两个向量在空间中距离很近,余弦相似度 ≈ 0.94。
3.2 Embedding模型的工作原理
3.3 相似度计算方法对比
余弦相似度(最常用):
cos(A, B) = (A · B) / (|A| × |B|)欧氏距离:
d(A, B) = √Σ(ai - bi)²内积(Dot Product):
A · B = Σ(ai × bi)对于语义搜索,余弦相似度最常用,因为它只关注方向(语义),不受向量长度(词频)影响。
四、Spring AI + PGVector实现语义搜索
4.1 项目依赖配置
<!-- pom.xml -->
<dependencies>
<!-- Spring AI核心 -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-openai-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
<!-- PGVector向量存储 -->
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
<version>1.0.0</version>
</dependency>
<!-- Spring Data JPA -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<!-- PostgreSQL驱动 -->
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<scope>runtime</scope>
</dependency>
</dependencies>4.2 数据库初始化
-- 启用pgvector扩展
CREATE EXTENSION IF NOT EXISTS vector;
-- 商品表(含向量字段)
CREATE TABLE products (
id BIGSERIAL PRIMARY KEY,
title VARCHAR(500) NOT NULL,
description TEXT,
category VARCHAR(100),
price DECIMAL(10, 2),
embedding VECTOR(1536), -- OpenAI text-embedding-3-small维度
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- HNSW索引(生产推荐)
CREATE INDEX ON products USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- 或使用IVFFlat索引(数据量<100万)
-- CREATE INDEX ON products USING ivfflat (embedding vector_cosine_ops)
-- WITH (lists = 100);4.3 核心配置
# application.yml
spring:
ai:
openai:
api-key: ${OPENAI_API_KEY}
embedding:
options:
model: text-embedding-3-small
dimensions: 1536
datasource:
url: jdbc:postgresql://localhost:5432/ecommerce
username: ${DB_USERNAME}
password: ${DB_PASSWORD}
ai:
vectorstore:
pgvector:
index-type: HNSW
distance-type: COSINE_DISTANCE
dimensions: 1536
initialize-schema: false # 生产环境用Flyway管理
# 语义搜索自定义配置
search:
semantic:
top-k: 20 # 召回数量
similarity-threshold: 0.70 # 相似度阈值
embedding-batch-size: 100 # 批量Embedding大小4.4 商品Embedding服务
package com.ecommerce.search.service;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.embedding.EmbeddingRequest;
import org.springframework.ai.embedding.EmbeddingResponse;
import org.springframework.ai.openai.OpenAiEmbeddingOptions;
import org.springframework.scheduling.annotation.Async;
import org.springframework.stereotype.Service;
import org.springframework.transaction.annotation.Transactional;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
@Service
@Slf4j
public class ProductEmbeddingService {
private final EmbeddingModel embeddingModel;
private final ProductRepository productRepository;
@Value("${search.semantic.embedding-batch-size:100}")
private int batchSize;
public ProductEmbeddingService(EmbeddingModel embeddingModel,
ProductRepository productRepository) {
this.embeddingModel = embeddingModel;
this.productRepository = productRepository;
}
/**
* 为商品生成Embedding文本
* 将标题、类目、关键描述拼接,提升语义丰富度
*/
public String buildEmbeddingText(Product product) {
StringBuilder sb = new StringBuilder();
sb.append(product.getTitle());
if (product.getCategory() != null) {
sb.append(" ").append(product.getCategory());
}
if (product.getKeywords() != null) {
sb.append(" ").append(String.join(" ", product.getKeywords()));
}
// 截取描述前200字,避免超过token限制
if (product.getDescription() != null) {
String desc = product.getDescription();
sb.append(" ").append(desc.substring(0, Math.min(200, desc.length())));
}
return sb.toString();
}
/**
* 单个商品Embedding(新增/更新时调用)
*/
@Transactional
public void embedProduct(Long productId) {
Product product = productRepository.findById(productId)
.orElseThrow(() -> new ProductNotFoundException(productId));
String text = buildEmbeddingText(product);
EmbeddingRequest request = new EmbeddingRequest(
List.of(text),
OpenAiEmbeddingOptions.builder()
.withModel("text-embedding-3-small")
.build()
);
EmbeddingResponse response = embeddingModel.call(request);
float[] embedding = response.getResults().get(0).getOutput();
product.setEmbedding(embedding);
productRepository.save(product);
log.info("商品[{}]Embedding完成,向量维度: {}", productId, embedding.length);
}
/**
* 批量商品Embedding(初始化/全量更新)
* 生产级实现:分批处理 + 进度日志 + 错误跳过
*/
@Async("embeddingTaskExecutor")
@Transactional
public CompletableFuture<BatchEmbeddingResult> batchEmbedProducts(List<Long> productIds) {
int total = productIds.size();
int successCount = 0;
int failCount = 0;
// 分批处理
List<List<Long>> batches = partitionList(productIds, batchSize);
for (int batchIndex = 0; batchIndex < batches.size(); batchIndex++) {
List<Long> batch = batches.get(batchIndex);
try {
List<Product> products = productRepository.findAllById(batch);
List<String> texts = products.stream()
.map(this::buildEmbeddingText)
.collect(Collectors.toList());
// 批量调用Embedding API(减少网络往返)
EmbeddingRequest request = new EmbeddingRequest(
texts,
OpenAiEmbeddingOptions.builder()
.withModel("text-embedding-3-small")
.build()
);
EmbeddingResponse response = embeddingModel.call(request);
// 将向量写回商品
for (int i = 0; i < products.size(); i++) {
float[] embedding = response.getResults().get(i).getOutput();
products.get(i).setEmbedding(embedding);
}
productRepository.saveAll(products);
successCount += products.size();
log.info("Embedding进度: {}/{} ({} batches done)",
successCount, total, batchIndex + 1);
// 限速:避免触发API限流(60 RPM)
if (batchIndex < batches.size() - 1) {
Thread.sleep(1000);
}
} catch (Exception e) {
log.error("第{}批Embedding失败: {}", batchIndex + 1, e.getMessage());
failCount += batch.size();
}
}
return CompletableFuture.completedFuture(
new BatchEmbeddingResult(total, successCount, failCount)
);
}
private <T> List<List<T>> partitionList(List<T> list, int size) {
return IntStream.range(0, (list.size() + size - 1) / size)
.mapToObj(i -> list.subList(i * size, Math.min((i + 1) * size, list.size())))
.collect(Collectors.toList());
}
}4.5 语义搜索核心实现
package com.ecommerce.search.service;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.stereotype.Service;
import org.springframework.jdbc.core.JdbcTemplate;
import java.util.List;
import java.util.Map;
@Service
@Slf4j
public class SemanticSearchService {
private final EmbeddingModel embeddingModel;
private final JdbcTemplate jdbcTemplate;
@Value("${search.semantic.top-k:20}")
private int topK;
@Value("${search.semantic.similarity-threshold:0.70}")
private double similarityThreshold;
public SemanticSearchService(EmbeddingModel embeddingModel,
JdbcTemplate jdbcTemplate) {
this.embeddingModel = embeddingModel;
this.jdbcTemplate = jdbcTemplate;
}
/**
* 语义搜索主入口
*/
public SemanticSearchResult search(String query, SearchFilter filter) {
long startTime = System.currentTimeMillis();
// 1. 将查询词转为向量
float[] queryEmbedding = embedQuery(query);
// 2. 向量近邻搜索
List<ProductScore> results = vectorSearch(queryEmbedding, filter);
// 3. 过滤低相似度结果
results = results.stream()
.filter(ps -> ps.getSimilarity() >= similarityThreshold)
.collect(Collectors.toList());
long duration = System.currentTimeMillis() - startTime;
log.info("语义搜索 [{}] 耗时{}ms,返回{}条结果", query, duration, results.size());
return SemanticSearchResult.builder()
.query(query)
.results(results)
.totalCount(results.size())
.searchTimeMs(duration)
.build();
}
/**
* 查询词Embedding(带缓存)
*/
@Cacheable(value = "queryEmbedding", key = "#query",
condition = "#query.length() <= 100")
public float[] embedQuery(String query) {
EmbeddingRequest request = new EmbeddingRequest(
List.of(query),
OpenAiEmbeddingOptions.builder()
.withModel("text-embedding-3-small")
.build()
);
return embeddingModel.call(request).getResults().get(0).getOutput();
}
/**
* PGVector向量搜索
* 使用原生SQL获得最优性能
*/
private List<ProductScore> vectorSearch(float[] queryEmbedding,
SearchFilter filter) {
// 构建向量字符串
String vectorStr = buildVectorString(queryEmbedding);
StringBuilder sql = new StringBuilder("""
SELECT
p.id,
p.title,
p.price,
p.category,
p.image_url,
1 - (p.embedding <=> ?::vector) AS similarity
FROM products p
WHERE p.embedding IS NOT NULL
""");
// 动态过滤条件
List<Object> params = new ArrayList<>();
params.add(vectorStr);
if (filter.getCategory() != null) {
sql.append(" AND p.category = ?");
params.add(filter.getCategory());
}
if (filter.getMinPrice() != null) {
sql.append(" AND p.price >= ?");
params.add(filter.getMinPrice());
}
if (filter.getMaxPrice() != null) {
sql.append(" AND p.price <= ?");
params.add(filter.getMaxPrice());
}
sql.append("""
ORDER BY p.embedding <=> ?::vector
LIMIT ?
""");
params.add(vectorStr); // 注意:ORDER BY需要重复传参
params.add(topK);
return jdbcTemplate.query(
sql.toString(),
params.toArray(),
(rs, rowNum) -> ProductScore.builder()
.productId(rs.getLong("id"))
.title(rs.getString("title"))
.price(rs.getBigDecimal("price"))
.category(rs.getString("category"))
.imageUrl(rs.getString("image_url"))
.similarity(rs.getDouble("similarity"))
.build()
);
}
/**
* float[]转PGVector格式字符串
*/
private String buildVectorString(float[] embedding) {
StringBuilder sb = new StringBuilder("[");
for (int i = 0; i < embedding.length; i++) {
if (i > 0) sb.append(",");
sb.append(embedding[i]);
}
sb.append("]");
return sb.toString();
}
}4.6 REST API层
package com.ecommerce.search.controller;
import org.springframework.http.ResponseEntity;
import org.springframework.web.bind.annotation.*;
@RestController
@RequestMapping("/api/v2/search")
@Validated
public class SearchController {
private final HybridSearchService hybridSearchService;
@GetMapping("/products")
public ResponseEntity<SearchResponse> search(
@RequestParam @NotBlank String query,
@RequestParam(defaultValue = "0") int page,
@RequestParam(defaultValue = "20") int size,
@RequestParam(required = false) String category,
@RequestParam(required = false) BigDecimal minPrice,
@RequestParam(required = false) BigDecimal maxPrice) {
SearchFilter filter = SearchFilter.builder()
.category(category)
.minPrice(minPrice)
.maxPrice(maxPrice)
.build();
SearchResponse response = hybridSearchService.search(query, filter, page, size);
return ResponseEntity.ok(response);
}
}五、Elasticsearch向量搜索:kNN Query配置
5.1 为什么还需要ES?
PGVector适合数据量在百万级以内的场景。当商品规模达到千万级,需要Elasticsearch的分布式能力。
5.2 ES 8.x索引mapping配置
PUT /products
{
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"ik_smart_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"properties": {
"id": { "type": "long" },
"title": {
"type": "text",
"analyzer": "ik_smart_analyzer",
"fields": {
"keyword": { "type": "keyword" }
}
},
"description": {
"type": "text",
"analyzer": "ik_smart_analyzer"
},
"category": { "type": "keyword" },
"price": { "type": "scaled_float", "scaling_factor": 100 },
"embedding": {
"type": "dense_vector",
"dims": 1536,
"index": true,
"similarity": "cosine",
"index_options": {
"type": "hnsw",
"m": 16,
"ef_construction": 100
}
}
}
}
}5.3 Java kNN搜索实现
package com.ecommerce.search.es;
import co.elastic.clients.elasticsearch.ElasticsearchClient;
import co.elastic.clients.elasticsearch.core.SearchResponse;
import co.elastic.clients.elasticsearch.core.search.Hit;
import co.elastic.clients.elasticsearch._types.query_dsl.*;
import org.springframework.stereotype.Service;
import java.io.IOException;
import java.util.List;
import java.util.stream.Collectors;
@Service
@Slf4j
public class ElasticsearchSemanticService {
private final ElasticsearchClient esClient;
private final ProductEmbeddingService embeddingService;
public ElasticsearchSemanticService(ElasticsearchClient esClient,
ProductEmbeddingService embeddingService) {
this.esClient = esClient;
this.embeddingService = embeddingService;
}
/**
* ES纯语义搜索(kNN query)
*/
public List<ProductScore> knnSearch(String query,
SearchFilter filter,
int topK) throws IOException {
float[] queryVector = embeddingService.embedQuery(query);
// 构建kNN查询
KnnQuery knnQuery = KnnQuery.of(k -> k
.field("embedding")
.queryVector(toFloatList(queryVector))
.numCandidates(topK * 5) // 候选集大小,一般是topK的5-10倍
.k(topK)
// 过滤条件(在kNN搜索范围内过滤)
.filter(buildFilterQuery(filter))
);
SearchResponse<ProductDocument> response = esClient.search(s -> s
.index("products")
.knn(knnQuery)
.source(src -> src
.filter(f -> f
.includes("id", "title", "price", "category", "image_url")
)
),
ProductDocument.class
);
return response.hits().hits().stream()
.map(hit -> ProductScore.builder()
.productId(hit.source().getId())
.title(hit.source().getTitle())
.price(hit.source().getPrice())
.similarity(hit.score() != null ? hit.score() : 0.0)
.build()
)
.collect(Collectors.toList());
}
/**
* 索引商品文档(含向量)
*/
public void indexProduct(Product product) throws IOException {
float[] embedding = embeddingService.embedQuery(
embeddingService.buildEmbeddingText(product)
);
ProductDocument doc = ProductDocument.builder()
.id(product.getId())
.title(product.getTitle())
.description(product.getDescription())
.category(product.getCategory())
.price(product.getPrice())
.embedding(toFloatList(embedding))
.build();
esClient.index(i -> i
.index("products")
.id(String.valueOf(product.getId()))
.document(doc)
);
}
/**
* 批量索引(初始化)
*/
public void bulkIndex(List<Product> products) throws IOException {
BulkRequest.Builder br = new BulkRequest.Builder();
for (Product product : products) {
float[] embedding = embeddingService.embedQuery(
embeddingService.buildEmbeddingText(product)
);
ProductDocument doc = toDocument(product, embedding);
br.operations(op -> op
.index(idx -> idx
.index("products")
.id(String.valueOf(product.getId()))
.document(doc)
)
);
}
BulkResponse result = esClient.bulk(br.build());
if (result.errors()) {
log.error("批量索引存在错误,请检查");
result.items().stream()
.filter(item -> item.error() != null)
.forEach(item -> log.error("文档{}索引失败: {}",
item.id(), item.error().reason()));
}
log.info("批量索引完成: {}条,耗时{}ms",
products.size(), result.took());
}
private List<Float> toFloatList(float[] array) {
List<Float> list = new ArrayList<>(array.length);
for (float f : array) list.add(f);
return list;
}
private Query buildFilterQuery(SearchFilter filter) {
BoolQuery.Builder bool = new BoolQuery.Builder();
if (filter.getCategory() != null) {
bool.filter(f -> f
.term(t -> t.field("category").value(filter.getCategory()))
);
}
if (filter.getMinPrice() != null || filter.getMaxPrice() != null) {
RangeQuery.Builder range = new RangeQuery.Builder().field("price");
if (filter.getMinPrice() != null) {
range.gte(JsonData.of(filter.getMinPrice()));
}
if (filter.getMaxPrice() != null) {
range.lte(JsonData.of(filter.getMaxPrice()));
}
bool.filter(f -> f.range(range.build()));
}
return bool.build()._toQuery();
}
}5.4 ES配置类
@Configuration
public class ElasticsearchConfig {
@Value("${elasticsearch.host:localhost}")
private String host;
@Value("${elasticsearch.port:9200}")
private int port;
@Value("${elasticsearch.username:elastic}")
private String username;
@Value("${elasticsearch.password}")
private String password;
@Bean
public ElasticsearchClient elasticsearchClient() {
RestClientBuilder builder = RestClient.builder(
new HttpHost(host, port, "https")
);
// 认证配置
final CredentialsProvider credentialsProvider = new BasicCredentialsProvider();
credentialsProvider.setCredentials(
AuthScope.ANY,
new UsernamePasswordCredentials(username, password)
);
builder.setHttpClientConfigCallback(httpClientBuilder ->
httpClientBuilder
.setDefaultCredentialsProvider(credentialsProvider)
// 连接池配置
.setMaxConnTotal(200)
.setMaxConnPerRoute(50)
// 超时配置
.setDefaultRequestConfig(RequestConfig.custom()
.setConnectTimeout(5000)
.setSocketTimeout(30000)
.build()
)
);
ElasticsearchTransport transport = new RestClientTransport(
builder.build(),
new JacksonJsonpMapper()
);
return new ElasticsearchClient(transport);
}
}六、混合搜索:BM25关键词 + 向量语义的融合
6.1 为什么需要混合搜索
纯语义搜索的问题:
- 对精确匹配不够友好(商品SKU、型号)
- 对高频词的语义区分不足("苹果手机" vs "苹果")
混合搜索的策略:用语义搜索提升召回,用关键词搜索保证精度。
6.2 RRF算法(Reciprocal Rank Fusion)
RRF是目前业界最常用的混合融分算法,ES 8.8+内置支持:
RRF_Score(d) = Σ 1 / (k + rank_i(d))其中 k = 60(常数),rank_i(d) 是文档d在第i个搜索列表中的排名。
6.3 Java实现RRF融合
package com.ecommerce.search.hybrid;
import org.springframework.stereotype.Service;
import java.util.*;
import java.util.stream.Collectors;
@Service
@Slf4j
public class HybridSearchService {
private static final int RRF_K = 60;
private final ElasticsearchSemanticService esSemanticService;
private final ElasticsearchKeywordService esKeywordService;
private final SearchResultReranker reranker;
/**
* 混合搜索主流程
*/
public SearchResponse search(String query,
SearchFilter filter,
int page,
int size) {
int recallSize = Math.max(100, size * 5); // 召回数量
// 并行执行两路搜索
CompletableFuture<List<ProductScore>> semanticFuture =
CompletableFuture.supplyAsync(() -> {
try {
return esSemanticService.knnSearch(query, filter, recallSize);
} catch (Exception e) {
log.error("语义搜索失败", e);
return Collections.emptyList();
}
});
CompletableFuture<List<ProductScore>> keywordFuture =
CompletableFuture.supplyAsync(() -> {
try {
return esKeywordService.bm25Search(query, filter, recallSize);
} catch (Exception e) {
log.error("关键词搜索失败", e);
return Collections.emptyList();
}
});
// 等待两路结果
CompletableFuture.allOf(semanticFuture, keywordFuture).join();
List<ProductScore> semanticResults = semanticFuture.join();
List<ProductScore> keywordResults = keywordFuture.join();
log.info("语义召回: {}条,关键词召回: {}条",
semanticResults.size(), keywordResults.size());
// RRF融合
List<ProductScore> merged = rrfMerge(semanticResults, keywordResults);
// 分页
int startIdx = page * size;
int endIdx = Math.min(startIdx + size, merged.size());
List<ProductScore> pageResults = startIdx < merged.size()
? merged.subList(startIdx, endIdx)
: Collections.emptyList();
return SearchResponse.builder()
.query(query)
.results(pageResults)
.totalCount(merged.size())
.page(page)
.size(size)
.build();
}
/**
* RRF融合算法实现
*
* @param list1 语义搜索结果(按相似度降序)
* @param list2 关键词搜索结果(按BM25分降序)
* @return 融合后的结果(按RRF分降序)
*/
public List<ProductScore> rrfMerge(List<ProductScore> list1,
List<ProductScore> list2) {
Map<Long, Double> rrfScores = new HashMap<>();
Map<Long, ProductScore> productMap = new HashMap<>();
// 处理第一个列表(语义搜索)
for (int rank = 0; rank < list1.size(); rank++) {
ProductScore ps = list1.get(rank);
double rrfScore = 1.0 / (RRF_K + rank + 1);
rrfScores.merge(ps.getProductId(), rrfScore, Double::sum);
productMap.put(ps.getProductId(), ps);
}
// 处理第二个列表(关键词搜索)
for (int rank = 0; rank < list2.size(); rank++) {
ProductScore ps = list2.get(rank);
double rrfScore = 1.0 / (RRF_K + rank + 1);
rrfScores.merge(ps.getProductId(), rrfScore, Double::sum);
productMap.putIfAbsent(ps.getProductId(), ps);
}
// 按RRF分降序排列
return rrfScores.entrySet().stream()
.sorted(Map.Entry.<Long, Double>comparingByValue().reversed())
.map(entry -> {
ProductScore ps = productMap.get(entry.getKey());
return ps.toBuilder()
.rrfScore(entry.getValue())
.build();
})
.collect(Collectors.toList());
}
/**
* 带权重的RRF(可调节语义/关键词权重)
*
* @param semanticWeight 语义搜索权重 (0-1)
*/
public List<ProductScore> weightedRrfMerge(List<ProductScore> semanticResults,
List<ProductScore> keywordResults,
double semanticWeight) {
double keywordWeight = 1.0 - semanticWeight;
Map<Long, Double> rrfScores = new HashMap<>();
Map<Long, ProductScore> productMap = new HashMap<>();
// 语义搜索列表(带权重)
for (int rank = 0; rank < semanticResults.size(); rank++) {
ProductScore ps = semanticResults.get(rank);
double rrfScore = semanticWeight / (RRF_K + rank + 1);
rrfScores.merge(ps.getProductId(), rrfScore, Double::sum);
productMap.put(ps.getProductId(), ps);
}
// 关键词搜索列表(带权重)
for (int rank = 0; rank < keywordResults.size(); rank++) {
ProductScore ps = keywordResults.get(rank);
double rrfScore = keywordWeight / (RRF_K + rank + 1);
rrfScores.merge(ps.getProductId(), rrfScore, Double::sum);
productMap.putIfAbsent(ps.getProductId(), ps);
}
return rrfScores.entrySet().stream()
.sorted(Map.Entry.<Long, Double>comparingByValue().reversed())
.map(entry -> {
ProductScore ps = productMap.get(entry.getKey());
return ps.toBuilder().rrfScore(entry.getValue()).build();
})
.collect(Collectors.toList());
}
}6.4 ES原生混合搜索(推荐)
ES 8.8+提供了原生的混合搜索API,无需在Java层手动融合:
/**
* 使用ES原生hybrid search(8.8+)
*/
public List<ProductScore> nativeHybridSearch(String query,
float[] queryVector,
SearchFilter filter,
int topK) throws IOException {
SearchResponse<ProductDocument> response = esClient.search(s -> s
.index("products")
// BM25关键词搜索
.query(q -> q
.bool(b -> b
.must(m -> m
.multiMatch(mm -> mm
.query(query)
.fields("title^3", "description^1")
.analyzer("ik_smart_analyzer")
)
)
.filter(buildFilterQuery(filter)._toQuery())
)
)
// kNN语义搜索
.knn(k -> k
.field("embedding")
.queryVector(toFloatList(queryVector))
.numCandidates(topK * 5)
.k(topK)
.filter(buildFilterQuery(filter))
)
// RRF融合(ES原生)
.rank(r -> r
.rrf(rrf -> rrf
.rankConstant((long) RRF_K)
.windowSize((long) topK * 2)
)
)
.size(topK),
ProductDocument.class
);
return response.hits().hits().stream()
.map(this::toProductScore)
.collect(Collectors.toList());
}七、搜索结果重排:Cross-Encoder重排序模型
7.1 两阶段检索架构
7.2 Cross-Encoder重排Java实现
package com.ecommerce.search.rerank;
import org.springframework.ai.chat.ChatClient;
import org.springframework.stereotype.Service;
import java.util.List;
import java.util.concurrent.CompletableFuture;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
@Service
@Slf4j
public class CrossEncoderReranker {
private final ChatClient chatClient;
private final ObjectMapper objectMapper;
/**
* 使用LLM做重排序(适合中小规模)
* 对于大规模,建议使用专用Cross-Encoder模型
*/
public List<ProductScore> rerank(String query,
List<ProductScore> candidates) {
if (candidates.size() <= 5) {
return candidates; // 候选数少时无需重排
}
// 构建重排提示词
String rerankPrompt = buildRerankPrompt(query, candidates);
String response = chatClient.call(rerankPrompt);
try {
RerankResult result = objectMapper.readValue(response, RerankResult.class);
return applyRerankResult(candidates, result);
} catch (Exception e) {
log.error("重排解析失败,返回原始顺序", e);
return candidates;
}
}
private String buildRerankPrompt(String query, List<ProductScore> candidates) {
StringBuilder sb = new StringBuilder();
sb.append("你是一个电商搜索相关性评估专家。\n");
sb.append("用户搜索词: ").append(query).append("\n\n");
sb.append("候选商品列表:\n");
for (int i = 0; i < candidates.size(); i++) {
ProductScore ps = candidates.get(i);
sb.append(String.format("[%d] ID:%d | %s | ¥%.2f\n",
i + 1, ps.getProductId(), ps.getTitle(), ps.getPrice()));
}
sb.append("\n请根据与搜索词的相关性,对上述商品重新排序。");
sb.append("返回JSON格式: {\"ranking\": [1,3,2,...], \"reason\": \"排序原因\"}");
return sb.toString();
}
/**
* 并行批量重排(适合高并发场景)
*/
public CompletableFuture<List<ProductScore>> asyncRerank(
String query, List<ProductScore> candidates) {
return CompletableFuture.supplyAsync(() -> rerank(query, candidates))
.orTimeout(3, TimeUnit.SECONDS)
.exceptionally(ex -> {
log.warn("重排超时,使用原始顺序");
return candidates;
});
}
}八、搜索体验优化:拼写纠错/同义词/搜索建议
8.1 拼写纠错
@Service
public class SpellCorrectionService {
private final ElasticsearchClient esClient;
/**
* 基于ES Suggest的拼写纠错
*/
public String correctSpelling(String query) throws IOException {
SearchResponse<Void> response = esClient.search(s -> s
.index("products")
.suggest(sg -> sg
.suggesters("spell-check", sug -> sug
.text(query)
.term(t -> t
.field("title")
.suggestMode(SuggestMode.Missing)
.sort(SuggestSort.Score)
.maxEdits(2)
.minWordLength(4)
)
)
)
.size(0),
Void.class
);
List<TermSuggestOption> options = response.suggest()
.get("spell-check")
.get(0)
.term()
.options();
if (options.isEmpty()) {
return query; // 无纠错建议
}
// 返回最高分建议
return options.get(0).text();
}
}8.2 搜索建议(Search Suggestion)
@Service
public class SearchSuggestionService {
private final ElasticsearchClient esClient;
private final RedisTemplate<String, Object> redisTemplate;
/**
* 实时搜索建议(前缀匹配 + 热门搜索融合)
*/
public List<String> getSuggestions(String prefix, int maxSuggestions) {
List<String> suggestions = new ArrayList<>();
// 1. 从ES获取Completion建议
try {
suggestions.addAll(getCompletionSuggestions(prefix, maxSuggestions));
} catch (Exception e) {
log.warn("ES建议获取失败", e);
}
// 2. 混入热门搜索词
List<String> hotSearches = getHotSearchKeywords(prefix, 3);
suggestions.addAll(0, hotSearches);
// 3. 去重 + 截断
return suggestions.stream()
.distinct()
.limit(maxSuggestions)
.collect(Collectors.toList());
}
private List<String> getHotSearchKeywords(String prefix, int limit) {
// 从Redis ZSet获取热门搜索(按搜索频次排序)
Set<Object> hotWords = redisTemplate.opsForZSet()
.reverseRange("hot:search:words", 0, 50);
if (hotWords == null) return Collections.emptyList();
return hotWords.stream()
.map(Object::toString)
.filter(w -> w.startsWith(prefix))
.limit(limit)
.collect(Collectors.toList());
}
/**
* 记录搜索行为(用于热词统计)
*/
public void recordSearch(String query) {
if (query == null || query.isBlank()) return;
String normalizedQuery = query.toLowerCase().trim();
redisTemplate.opsForZSet()
.incrementScore("hot:search:words", normalizedQuery, 1);
}
}8.3 同义词配置(ES侧)
PUT /products/_settings
{
"analysis": {
"filter": {
"synonym_filter": {
"type": "synonym",
"synonyms": [
"红色,朱红,玫瑰色,酒红,砖红",
"裙子,连衣裙,裙装",
"手机,智能手机,移动电话",
"笔记本,笔记本电脑,laptop,便携电脑"
]
}
},
"analyzer": {
"ik_synonym_analyzer": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": ["lowercase", "synonym_filter"]
}
}
}
}九、性能调优:ANN索引参数调优(HNSW参数解析)
9.1 HNSW算法核心参数
HNSW(Hierarchical Navigable Small World)是目前最主流的近似最近邻(ANN)算法:
9.2 参数调优矩阵
| 场景 | M | ef_construction | ef_search | 召回率 | QPS |
|---|---|---|---|---|---|
| 高速低精(日志搜索) | 8 | 32 | 20 | ~90% | 5000 |
| 平衡(电商推荐) | 16 | 64 | 40 | ~95% | 2000 |
| 高精低速(医疗检索) | 32 | 128 | 100 | ~99% | 500 |
9.3 PGVector HNSW参数配置
-- 创建HNSW索引(平衡场景推荐参数)
CREATE INDEX CONCURRENTLY idx_products_embedding_hnsw
ON products USING hnsw (embedding vector_cosine_ops)
WITH (
m = 16, -- 每层最大连接数
ef_construction = 64 -- 构建时候选集大小
);
-- 查询时动态设置ef(覆盖默认值)
SET hnsw.ef_search = 40;
SELECT id, title, 1 - (embedding <=> '[...]'::vector) AS similarity
FROM products
ORDER BY embedding <=> '[...]'::vector
LIMIT 20;9.4 Java性能测试代码
@SpringBootTest
class SemanticSearchPerformanceTest {
@Autowired
private SemanticSearchService searchService;
@Test
void benchmarkSearchLatency() throws Exception {
String[] testQueries = {
"红色裙子", "白色运动鞋", "蓝色牛仔裤",
"韩版卫衣", "显瘦连衣裙", "大码女装"
};
// 预热
for (int i = 0; i < 10; i++) {
searchService.search(testQueries[i % testQueries.length],
SearchFilter.empty());
}
// 正式测试
List<Long> latencies = new ArrayList<>();
for (int i = 0; i < 100; i++) {
long start = System.nanoTime();
searchService.search(testQueries[i % testQueries.length],
SearchFilter.empty());
long latency = (System.nanoTime() - start) / 1_000_000;
latencies.add(latency);
}
latencies.sort(Long::compareTo);
System.out.printf("P50: %dms%n", latencies.get(49));
System.out.printf("P95: %dms%n", latencies.get(94));
System.out.printf("P99: %dms%n", latencies.get(98));
System.out.printf("Max: %dms%n", latencies.get(99));
}
}9.5 性能测试结果
陈磊团队的实测数据(商品规模:500万,服务器:8核32G):
| 方案 | P50延迟 | P99延迟 | QPS |
|---|---|---|---|
| PGVector + IVFFlat | 45ms | 180ms | 800 |
| PGVector + HNSW(M=16) | 28ms | 95ms | 1500 |
| ES kNN(ef=40) | 35ms | 120ms | 1200 |
| ES混合搜索 | 52ms | 160ms | 900 |
查询Embedding缓存的作用:相同查询词Embedding缓存命中后,P50延迟从28ms → 8ms。
十、A/B测试:如何证明语义搜索比关键词搜索更好
10.1 搜索效果评估体系
10.2 A/B测试框架实现
@Service
@Slf4j
public class SearchAbTestService {
private final RedisTemplate<String, Object> redisTemplate;
private final SemanticSearchService semanticService;
private final KeywordSearchService keywordService;
private final SearchEventTracker tracker;
/**
* A/B测试路由
* A组:传统关键词搜索(控制组)
* B组:混合语义搜索(实验组)
*/
public SearchResponse search(String query,
SearchFilter filter,
String userId,
int page,
int size) {
// 确定用户分组(基于userId哈希,保证同一用户始终在同一组)
String group = assignGroup(userId);
long startTime = System.currentTimeMillis();
SearchResponse response;
if ("B".equals(group)) {
response = semanticService.search(query, filter, page, size);
response.setSearchType("HYBRID_SEMANTIC");
} else {
response = keywordService.search(query, filter, page, size);
response.setSearchType("KEYWORD");
}
// 记录搜索事件(用于后续分析)
tracker.trackSearchEvent(SearchEvent.builder()
.userId(userId)
.query(query)
.group(group)
.searchType(response.getSearchType())
.resultCount(response.getTotalCount())
.latencyMs(System.currentTimeMillis() - startTime)
.timestamp(Instant.now())
.build()
);
return response;
}
/**
* 用户分组(50/50)
*/
private String assignGroup(String userId) {
int hash = Math.abs(userId.hashCode() % 100);
return hash < 50 ? "A" : "B";
}
/**
* A/B测试统计报告
*/
public AbTestReport generateReport(LocalDate startDate, LocalDate endDate) {
// 从埋点数据库查询统计数据
AbTestStats groupA = statsRepository.getStats("A", startDate, endDate);
AbTestStats groupB = statsRepository.getStats("B", startDate, endDate);
// 计算提升幅度
double ctrLift = (groupB.getCtr() - groupA.getCtr()) / groupA.getCtr() * 100;
double conversionLift = (groupB.getConversion() - groupA.getConversion())
/ groupA.getConversion() * 100;
double zeroResultLift = (groupA.getZeroResultRate() - groupB.getZeroResultRate())
/ groupA.getZeroResultRate() * 100;
// 统计显著性检验(卡方检验)
boolean isSignificant = chiSquareTest(
groupA.getClickCount(), groupA.getSearchCount(),
groupB.getClickCount(), groupB.getSearchCount()
);
return AbTestReport.builder()
.periodStart(startDate)
.periodEnd(endDate)
.groupA(groupA)
.groupB(groupB)
.ctrLiftPercent(ctrLift)
.conversionLiftPercent(conversionLift)
.zeroResultReductionPercent(zeroResultLift)
.statisticallySignificant(isSignificant)
.recommendation(isSignificant && ctrLift > 0 ? "全量上线B组" : "继续观察")
.build();
}
/**
* 卡方显著性检验
*/
private boolean chiSquareTest(long clickA, long searchA,
long clickB, long searchB) {
double ctrA = (double) clickA / searchA;
double ctrB = (double) clickB / searchB;
double pooledCtr = (double)(clickA + clickB) / (searchA + searchB);
double chiSquare =
Math.pow(clickA - searchA * pooledCtr, 2) / (searchA * pooledCtr) +
Math.pow(searchA - clickA - searchA * (1 - pooledCtr), 2) / (searchA * (1 - pooledCtr)) +
Math.pow(clickB - searchB * pooledCtr, 2) / (searchB * pooledCtr) +
Math.pow(searchB - clickB - searchB * (1 - pooledCtr), 2) / (searchB * (1 - pooledCtr));
// P<0.05 (卡方值 > 3.841)
return chiSquare > 3.841;
}
}10.3 陈磊团队的A/B测试结果
测试周期:2025-10-01 至 2025-10-31(整月) 流量分配:A组(关键词)50% vs B组(混合语义)50%
| 指标 | A组(关键词) | B组(混合语义) | 提升幅度 |
|---|---|---|---|
| 搜索CTR | 12.3% | 16.8% | +36.6% |
| 搜索转化率 | 6.2% | 8.7% | +40.3% |
| 零结果率 | 18.1% | 4.2% | -76.8% |
| P99搜索延迟 | 85ms | 160ms | -88%(延迟增加) |
| 人均搜索次数 | 3.2次 | 2.8次 | -12.5%(更快找到) |
| 统计显著性 | p < 0.001 | -- | 显著 |
结论:混合语义搜索在转化率上提升40%,零结果率下降77%,统计显著,全量上线。
FAQ
Q1:用OpenAI Embedding还是本地部署的模型?
生产场景建议对比测试:
- OpenAI text-embedding-3-small:质量最高,延迟约50-100ms,成本$0.02/1M tokens
- 阿里通义千问Embedding:国内延迟优秀,适合中文场景
- BGE-M3本地部署:零成本,但需要GPU,延迟20-50ms(A10 GPU)
对于陈磊团队(中文电商),最终选择了阿里云text-embedding-v3,延迟比OpenAI低30%,中文效果相当。
Q2:商品Embedding多久需要更新一次?
- 商品标题/描述变更:立即触发更新(监听MQ消息)
- 模型版本升级:全量重新Embedding(离线任务)
- 定期维护:建议每季度检查Embedding质量
Q3:10万商品全量Embedding需要多久?
以text-embedding-3-small为例(批量100个/次,60批/分钟API限制):
- 10万商品 = 1000批
- 理论时间 = 1000/60 ≈ 17分钟
- 实际考虑重试和网络波动:约25-30分钟
Q4:相似度阈值0.70怎么来的?
这个值需要业务标注来确定:
- 随机抽取500个查询词
- 人工标注"相关"/"不相关"
- 绘制PR曲线,找F1最优的阈值
- 一般电商场景在0.65-0.75之间
Q5:语义搜索对"精确商品"搜索效果差怎么办?
例如搜索"iPhone 15 Pro 256G 钛金黑"这类精确搜索,应该优先走关键词搜索。
解决方案:意图识别 + 路由策略
- 检测到型号/SKU模式 → 优先关键词搜索(权重0.9)
- 普通描述类搜索 → 优先语义搜索(权重0.9)
- 混合场景 → 平衡权重(0.5/0.5)
总结
语义搜索改造的核心路径:
传统关键词搜索
↓ 加入Embedding + 向量数据库
语义召回
↓ 融合BM25关键词搜索(RRF)
混合搜索
↓ Cross-Encoder重排
精准排序
↓ 搜索建议 + 拼写纠错
完整搜索体验陈磊团队的改造历时3周,带来了40%转化率提升。技术上的核心是Spring AI + PGVector(百万级数据)或Elasticsearch kNN(千万级数据),融合策略选用RRF算法,简单高效。
最重要的是:一定要用A/B测试来量化效果,让数据说话,而不是靠感觉。
