第1825篇：构建自己的开源工具库——从内部工具到GitHub Star的经验

老张2026/4/30大约 9 分钟

第1825篇：构建自己的开源工具库——从内部工具到GitHub Star的经验

我第一次开源自己的项目是在2022年。

那是一个内部用的JSON Schema验证工具，写完之后整理了一下，加了README，推到了GitHub。发出去之后，连续一个月零Star、零Fork。我也没太在意，以为开源就是这样，把代码放上去，等着被发现。

后来我才慢慢搞明白：把代码放上去叫"公开代码"，和"开源项目"是两回事。

这篇文章是我在做AI相关开源工具这两年里摸索出来的东西——不是什么成功学，我的项目也没有几千Star，但有一些实实在在帮到别人的功能，Star数从零到几百的过程让我学到了很多。

一、先想清楚做什么，比怎么做更重要

我见过很多"开源"项目，本质是把内部的业务代码清洗了一下推上去。这类项目通常的结果是：代码耦合很多业务假设，README写得含糊，别人根本用不起来。

一个能被别人用起来的开源工具，需要满足几个条件：

解决一个具体的、有普遍性的问题。 不是"我需要这个"，而是"很多做类似事情的人都会遇到这个问题"。

和现有工具有明显差异。 如果已经有成熟的开源项目解决了这个问题，你的工具要么质量明显更高，要么角度明显不同，要么专注于某个特定场景。重复造轮子不是不行，但要说清楚你的轮子和别人的区别是什么。

作者自己在用。 这听起来理所当然，但很多人做开源工具是因为"这个工具应该存在"，而不是"我每天都需要这个"。前者的维护动力很难持久。

二、我做的一个AI工具库的起点

去年我在做RAG系统时，发现Java生态里有个空白：好用的文档分块（Chunking）工具。

Python那边有LangChain、LlamaIndex，分块策略非常丰富。但Java这边，要么自己写，要么用LangChain4j里相对简单的实现。对于复杂文档（带层次结构的技术文档、带表格的业务文档等），现有工具的分块效果不理想。

这是一个真实的痛点，我自己每个项目都在重复写这些逻辑。所以我把它提取出来，做成了一个独立的库：java-text-chunker。

三、项目结构的设计原则

一个好的工具库，项目结构本身就是一种沟通。

java-text-chunker/
├── chunker-core/               # 核心模块，最小依赖
│   ├── src/main/java/
│   │   └── io/github/laozhangt/chunker/
│   │       ├── api/            # 公共接口定义
│   │       │   ├── Chunker.java
│   │       │   ├── Chunk.java
│   │       │   └── ChunkMetadata.java
│   │       ├── impl/           # 核心实现
│   │       │   ├── FixedSizeChunker.java
│   │       │   ├── SentenceChunker.java
│   │       │   └── RecursiveChunker.java
│   │       └── config/         # 配置类
│   │           └── ChunkerConfig.java
│   └── src/test/
├── chunker-spring/             # Spring集成模块（可选）
├── chunker-examples/           # 使用示例（单独模块）
└── docs/                       # 文档目录

几个关键的设计原则：

核心模块零侵入依赖。 把核心功能放在一个没有Spring、没有特定框架依赖的模块里。用户如果只用核心功能，不需要把整个Spring带进来。这个决策让你的工具能被更广泛的Java项目使用。

接口先行。 先定义好公共API接口，再写实现。这样做有两个好处：用户可以面向接口编程，不依赖具体实现；你也可以提供多种实现，用户按需选择。

提供可运行的示例。 示例代码单独成一个模块，而且要能直接跑起来，不需要任何额外配置。

四、核心接口设计

/**
 * 文本分块器核心接口
 * 所有分块策略都实现这个接口
 */
public interface Chunker {

    /**
     * 将文本分块
     *
     * @param text 输入文本
     * @return 分块结果列表，保持原始顺序
     */
    List<Chunk> chunk(String text);

    /**
     * 带元数据的分块（文档来源、页码等）
     */
    default List<Chunk> chunk(String text, ChunkMetadata parentMetadata) {
        List<Chunk> chunks = chunk(text);
        chunks.forEach(c -> c.mergeParentMetadata(parentMetadata));
        return chunks;
    }
}

/**
 * 分块结果
 */
@Value
@Builder
public class Chunk {
    String content;
    int startIndex;          // 在原文中的起始位置
    int endIndex;            // 在原文中的结束位置
    ChunkMetadata metadata;  // 元数据（来源文档、章节信息等）

    public int length() {
        return content.length();
    }

    public void mergeParentMetadata(ChunkMetadata parent) {
        // 合并父级元数据逻辑
    }
}

/**
 * 分块元数据
 */
@Data
@Builder
@NoArgsConstructor
@AllArgsConstructor
public class ChunkMetadata {
    private String documentId;
    private String sourceFile;
    private Integer pageNumber;
    private String sectionTitle;
    private Map<String, String> customAttributes;

    // 链式构建，便于使用
    public ChunkMetadata withAttribute(String key, String value) {
        if (customAttributes == null) {
            customAttributes = new HashMap<>();
        }
        customAttributes.put(key, value);
        return this;
    }
}

五、实现一个关键功能——结构感知分块

这是这个库里最有技术含量的部分，也是和其他工具差异化的核心：

/**
 * 结构感知分块器
 * 能识别Markdown/文本的层次结构（标题、段落、列表），
 * 按语义边界分块而不是简单按字符数截断
 */
public class StructureAwareChunker implements Chunker {

    private final int maxChunkSize;
    private final int minChunkSize;
    private final int overlapSize;

    public StructureAwareChunker(int maxChunkSize, int minChunkSize, int overlapSize) {
        this.maxChunkSize = maxChunkSize;
        this.minChunkSize = minChunkSize;
        this.overlapSize = overlapSize;
    }

    @Override
    public List<Chunk> chunk(String text) {
        // Step 1: 解析文档结构
        DocumentStructure structure = parseStructure(text);

        // Step 2: 按结构分块
        List<Chunk> rawChunks = splitByStructure(structure);

        // Step 3: 处理过大和过小的块
        List<Chunk> refinedChunks = refineChunks(rawChunks);

        // Step 4: 添加重叠（提升RAG召回率）
        return addOverlap(refinedChunks);
    }

    private DocumentStructure parseStructure(String text) {
        DocumentStructure structure = new DocumentStructure();
        String[] lines = text.split("\n");
        int currentPos = 0;

        for (String line : lines) {
            int lineEnd = currentPos + line.length();

            if (isH1(line)) {
                structure.addNode(new StructureNode(NodeType.H1, line, currentPos, lineEnd));
            } else if (isH2(line)) {
                structure.addNode(new StructureNode(NodeType.H2, line, currentPos, lineEnd));
            } else if (isH3(line)) {
                structure.addNode(new StructureNode(NodeType.H3, line, currentPos, lineEnd));
            } else if (isCodeBlock(line)) {
                structure.addNode(new StructureNode(NodeType.CODE, line, currentPos, lineEnd));
            } else if (isParagraphBreak(line)) {
                structure.addNode(new StructureNode(NodeType.PARAGRAPH_BREAK, "", currentPos, lineEnd));
            } else {
                structure.addTextContent(line, currentPos, lineEnd);
            }

            currentPos = lineEnd + 1; // +1 for newline
        }
        return structure;
    }

    private boolean isH1(String line) {
        return line.startsWith("# ") && !line.startsWith("## ");
    }

    private boolean isH2(String line) {
        return line.startsWith("## ") && !line.startsWith("### ");
    }

    private boolean isH3(String line) {
        return line.startsWith("### ");
    }

    private boolean isCodeBlock(String line) {
        return line.startsWith("```");
    }

    private boolean isParagraphBreak(String line) {
        return line.trim().isEmpty();
    }

    private List<Chunk> splitByStructure(DocumentStructure structure) {
        List<Chunk> chunks = new ArrayList<>();
        StringBuilder currentChunk = new StringBuilder();
        int chunkStart = 0;
        String currentSection = "";

        for (StructureNode node : structure.getNodes()) {
            switch (node.type()) {
                case H1, H2 -> {
                    // 遇到主要标题，先把当前chunk收起来
                    if (currentChunk.length() >= minChunkSize) {
                        chunks.add(buildChunk(currentChunk.toString(), chunkStart, node.startPos(), currentSection));
                        currentChunk = new StringBuilder();
                        chunkStart = node.startPos();
                    }
                    currentSection = node.content();
                    currentChunk.append(node.content()).append("\n");
                }
                case H3 -> {
                    // 次级标题，检查当前chunk是否已经很大了
                    if (currentChunk.length() > maxChunkSize * 0.8) {
                        chunks.add(buildChunk(currentChunk.toString(), chunkStart, node.startPos(), currentSection));
                        currentChunk = new StringBuilder();
                        chunkStart = node.startPos();
                    }
                    currentChunk.append(node.content()).append("\n");
                }
                case CODE -> {
                    // 代码块尽量保持完整
                    currentChunk.append(node.content()).append("\n");
                }
                case PARAGRAPH_BREAK -> {
                    // 段落边界是分块的好位置
                    if (currentChunk.length() >= maxChunkSize) {
                        chunks.add(buildChunk(currentChunk.toString(), chunkStart, node.startPos(), currentSection));
                        currentChunk = new StringBuilder();
                        chunkStart = node.startPos();
                    }
                }
                default -> currentChunk.append(node.content()).append("\n");
            }
        }

        // 处理最后一个chunk
        if (currentChunk.length() >= minChunkSize) {
            chunks.add(buildChunk(currentChunk.toString(), chunkStart, chunkStart + currentChunk.length(), currentSection));
        }

        return chunks;
    }

    private Chunk buildChunk(String content, int start, int end, String section) {
        return Chunk.builder()
            .content(content.trim())
            .startIndex(start)
            .endIndex(end)
            .metadata(ChunkMetadata.builder()
                .sectionTitle(section)
                .build())
            .build();
    }

    // refineChunks 和 addOverlap 的实现省略
    private List<Chunk> refineChunks(List<Chunk> chunks) { return chunks; }
    private List<Chunk> addOverlap(List<Chunk> chunks) { return chunks; }
}

六、让项目被发现：README是门面

代码写完只是开始。一个新用户第一眼看到的是README，它决定了他会不会继续往下看。

我的README模板结构：

# java-text-chunker

> 为RAG场景优化的Java文本分块工具，支持多种分块策略和文档格式

[![Maven Central](badge)](https://central.sonatype.com/...)
[![License: Apache 2.0](badge)](LICENSE)

## 解决什么问题

[2-3句话，说清楚这个工具的使用场景和痛点]

## 快速开始

[最少代码、能运行的示例，30秒内能看到效果]

## 核心特性

[用列表，每条不超过一行，突出差异化特性]

## 文档

[分块策略对比 | 配置参数说明 | Spring集成指南]

## 性能基准

[可选但加分，说明和竞品的对比]

最重要的是"快速开始"部分。我见过太多项目，README写了一大堆介绍，但示例代码要装5个依赖、配3个环境变量、还依赖外部服务。

你的快速开始示例，最好满足：

只依赖这一个库
10行以内能看到输出
不需要任何外部配置

// 这就是你应该在README里的快速开始示例
ChunkerConfig config = ChunkerConfig.builder()
    .maxChunkSize(500)
    .overlapSize(50)
    .build();

Chunker chunker = new StructureAwareChunker(config);

List<Chunk> chunks = chunker.chunk("你的文档内容...");
chunks.forEach(chunk -> System.out.println(
    "Chunk[" + chunk.length() + "]: " + chunk.getContent().substring(0, 50) + "..."
));

七、发布到Maven Central

这一步很多人不做，导致用户需要手动下载Jar包，大大增加使用门槛。

发布Maven Central的步骤（简化版）：

注册Sonatype账号
创建Jira工单，申请GroupId
配置GPG签名
配置pom.xml的发布信息
运行mvn deploy

关键的pom.xml配置：

<build>
    <plugins>
        <!-- 源码包 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-source-plugin</artifactId>
            <executions>
                <execution>
                    <id>attach-sources</id>
                    <goals><goal>jar-no-fork</goal></goals>
                </execution>
            </executions>
        </plugin>

        <!-- Javadoc -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-javadoc-plugin</artifactId>
            <executions>
                <execution>
                    <id>attach-javadocs</id>
                    <goals><goal>jar</goal></goals>
                </execution>
            </executions>
        </plugin>

        <!-- GPG签名 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-gpg-plugin</artifactId>
            <executions>
                <execution>
                    <id>sign-artifacts</id>
                    <phase>verify</phase>
                    <goals><goal>sign</goal></goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

发布到Central之后，用户只需要：

<dependency>
    <groupId>io.github.laozhangt</groupId>
    <artifactId>java-text-chunker</artifactId>
    <version>1.0.0</version>
</dependency>

这个体验的差距是非常大的。

八、发布之后怎么推广

发布只是开始，让人知道才能有用户。

在相关社区分享： 做完AI工具，去相关的Java AI讨论帖、LangChain4j的Discussions、Reddit的r/java、v2ex、掘金这些地方写一个介绍帖。不是广告，是"我遇到了这个问题，做了这个工具，如果你有类似场景欢迎试试"。

在自己的文章里引用： 我写了几篇关于RAG系统的文章，在代码示例里用了自己的工具，这带来了一批精准用户。

及时回应Issues： 早期如果有人提Issue，24小时内回复，哪怕只是"已收到，正在看"。一个有人维护的项目比一个无人响应的项目更容易传播。

记录CHANGELOG： 每次发版认真写CHANGELOG，告诉用户你改了什么。这是对用户时间的尊重，也展示了项目在持续进步。

九、我踩的最大坑

过度设计。

最初版本我设计了很多抽象层，支持插件化扩展，支持配置中心，支持分布式缓存……结果代码复杂度很高，文档写不清楚，用户用起来一堆问题。

后来我大幅简化，砍掉了70%的功能，只保留最核心的分块逻辑。结果更多人用了，反馈更好，维护起来也轻松。

开源工具不是要做一个大而全的平台，而是要做一把锋利的刀。锋利意味着专注，专注意味着在某一件事上做到极致。