Ollama本地LLM部署：私有化AI服务完整搭建与Spring AI集成

老张2026/4/30大约 6 分钟

Ollama本地LLM部署：私有化AI服务完整搭建与Spring AI集成

适读人群：需要私有化部署、数据不出本地的Java工程师 阅读时长：约16分钟 文章价值：从安装Ollama到Spring AI集成，完整落地私有化AI服务

先说一件真实的事

老王所在的公司是做医疗软件的，去年想给系统加AI功能。需求很明确：智能问诊辅助、病历摘要生成。

但合规部门一票否决了接入OpenAI的方案——患者数据不能出国境，这是红线。接国内模型API也不行，因为甲方医院的网络是隔离的，根本连不上外网。

必须本地部署。

当时老王找到我，说不想搞那种复杂的GPU集群，就一台配了16GB显存的工作站，能不能跑起来一个像样的LLM？

我说：可以，用Ollama，一个命令的事。三天后他们的系统就用上了本地的Qwen2.5，数据完全不出机器。

为什么选 Ollama

本地部署 LLM 有好几种方案，Ollama 是我最推荐的起点：

方案	上手难度	资源占用	适合场景
Ollama	极低（一行命令）	低（支持CPU+GPU混合）	开发测试、小规模生产
vLLM	中等	高（需要GPU）	高并发生产
LMDeploy	较高	高	国内模型优化部署
llama.cpp	较高	低（纯CPU可运行）	极低资源环境
本地GPU集群	很高	很高	大规模企业

Ollama 的核心优势：兼容 OpenAI API 格式，这意味着所有支持 OpenAI 的工具（包括 Spring AI）几乎零成本切换。

第一步：安装 Ollama

macOS / Linux：

curl -fsSL https://ollama.com/install.sh | sh

Windows：去 ollama.com 下载安装包，Next Next Finish。

安装完成后，启动服务（macOS/Linux 会自动作为后台服务运行）：

# 验证安装
ollama --version

# 启动服务（如果没有自动启动）
ollama serve

第二步：拉取并运行模型

Ollama 支持的模型非常丰富，选型参考：

模型	参数量	显存/内存需求	中文能力	推荐场景
qwen2.5:7b	7B	8GB	强	日常开发首选
qwen2.5:14b	14B	16GB	强	生产环境
llama3.1:8b	8B	8GB	中	英文场景
deepseek-r1:7b	7B	8GB	强	推理任务
nomic-embed-text	-	1GB	好	Embedding专用

# 拉取 Qwen2.5 7B（约4.7GB，中文效果好）
ollama pull qwen2.5:7b

# 拉取 Embedding 模型（RAG必备）
ollama pull nomic-embed-text

# 运行交互测试
ollama run qwen2.5:7b
# 输入：你好，介绍一下自己
# Ctrl+D 退出

验证 API 接口是否正常：

# Ollama 默认在 11434 端口提供服务
curl http://localhost:11434/api/generate -d '{
  "model": "qwen2.5:7b",
  "prompt": "用一句话介绍Spring AI",
  "stream": false
}'

第三步：Spring AI 集成 Ollama

依赖配置

<dependencies>
    <!-- Spring AI Ollama 集成 -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-starter-model-ollama</artifactId>
    </dependency>
    
    <!-- 如果同时需要向量数据库 -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-autoconfigure</artifactId>
    </dependency>
</dependencies>

application.yml 配置

spring:
  ai:
    ollama:
      base-url: http://localhost:11434
      chat:
        model: qwen2.5:7b
        options:
          temperature: 0.7
          top-p: 0.9
          num-ctx: 4096       # 上下文窗口大小
          num-predict: 1024   # 最大生成长度
      embedding:
        model: nomic-embed-text

核心代码

@Configuration
public class OllamaConfig {

    @Bean
    public ChatClient chatClient(OllamaChatModel chatModel) {
        return ChatClient.builder(chatModel)
            .defaultSystem("""
                你是一个专业的助手，请用简洁准确的中文回答问题。
                如果不确定，请直接说不确定，不要编造信息。
                """)
            .defaultAdvisors(new SimpleLoggerAdvisor())
            .build();
    }
}

@Service
@Slf4j
public class LocalLlmService {

    private final ChatClient chatClient;
    private final EmbeddingModel embeddingModel;

    public LocalLlmService(ChatClient chatClient, EmbeddingModel embeddingModel) {
        this.chatClient = chatClient;
        this.embeddingModel = embeddingModel;
    }

    /**
     * 普通对话
     */
    public String chat(String message) {
        log.info("本地LLM请求: {}", message);
        return chatClient.prompt()
            .user(message)
            .call()
            .content();
    }

    /**
     * 流式对话（适合前端实时显示）
     */
    public Flux<String> streamChat(String message) {
        return chatClient.prompt()
            .user(message)
            .stream()
            .content();
    }

    /**
     * 生成文本Embedding
     */
    public float[] embed(String text) {
        EmbeddingResponse response = embeddingModel.embedForResponse(List.of(text));
        return response.getResults().get(0).getOutput();
    }

    /**
     * 带选项的对话（动态调整参数）
     */
    public String chatWithOptions(String message, double temperature, int maxTokens) {
        return chatClient.prompt()
            .options(OllamaOptions.builder()
                .withTemperature(temperature)
                .withNumPredict(maxTokens)
                .build())
            .user(message)
            .call()
            .content();
    }
}

第四步：搭建完整的私有RAG服务

有了本地LLM和Embedding，再加上本地向量数据库，就可以搭建完全离线的RAG：

@Service
@Slf4j
public class PrivateRagService {

    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    private static final String RAG_PROMPT = """
        你是一个专业助手。请根据以下参考资料回答用户问题。
        如果参考资料中没有相关信息，请如实告知，不要编造。
        
        参考资料：
        {context}
        
        用户问题：{question}
        """;

    public String query(String question) {
        // 1. 检索相关文档
        List<Document> docs = vectorStore.similaritySearch(
            SearchRequest.query(question)
                .withTopK(5)
                .withSimilarityThreshold(0.6)
        );

        if (docs.isEmpty()) {
            return "抱歉，知识库中没有找到相关信息。";
        }

        // 2. 构建上下文
        String context = docs.stream()
            .map(doc -> "- " + doc.getText())
            .collect(Collectors.joining("\n"));

        // 3. 调用本地LLM
        return chatClient.prompt()
            .user(u -> u.text(RAG_PROMPT)
                .param("context", context)
                .param("question", question))
            .call()
            .content();
    }
}

生产部署注意事项

硬件配置参考

Ollama 服务化部署（Docker）

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    # GPU支持（需要nvidia-docker）
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped
    
  # 模型初始化（拉取模型）
  ollama-init:
    image: curlimages/curl:latest
    depends_on:
      - ollama
    entrypoint: >
      sh -c "
        sleep 5 &&
        curl -X POST http://ollama:11434/api/pull -d '{\"name\":\"qwen2.5:7b\"}' &&
        curl -X POST http://ollama:11434/api/pull -d '{\"name\":\"nomic-embed-text\"}'
      "

volumes:
  ollama_data:

性能调优配置

# Ollama 环境变量
OLLAMA_NUM_PARALLEL: 4        # 最大并发请求数
OLLAMA_MAX_LOADED_MODELS: 2   # 同时加载的模型数
OLLAMA_KEEP_ALIVE: "10m"      # 模型在内存中保持时长

// Spring AI 连接池配置
@Configuration
public class OllamaConnectionConfig {

    @Bean
    public RestClient.Builder restClientBuilder() {
        return RestClient.builder()
            .requestFactory(new HttpComponentsClientHttpRequestFactory(
                HttpClients.custom()
                    .setMaxConnTotal(20)        // 最大连接数
                    .setMaxConnPerRoute(10)     // 每路由最大连接
                    .setConnectionTimeToLive(Duration.ofMinutes(5))
                    .build()
            ));
    }
}

常见问题排查

问题	原因	解决方法
首次请求很慢	模型冷加载到内存	设置 `OLLAMA_KEEP_ALIVE` 保持热加载
内存不足OOM	模型太大	换7B量化版（Q4_K_M），约4GB
GPU未被使用	驱动未安装	安装 CUDA + nvidia-docker
中文乱码	模型不支持中文	换 Qwen/DeepSeek 系列
返回速度慢	纯CPU推理	添加GPU，速度提升10-50倍
Spring AI连接失败	端口被占用	检查11434端口，或修改 `base-url`

小结

Ollama + Spring AI 的组合，让私有化AI部署从"大工程"变成了"周末项目"。关键步骤：

装Ollama，拉模型，三分钟搞定
加 Spring AI Ollama 依赖，配置 base-url，代码几乎不用改
Embedding 用 nomic-embed-text，效果不错还免费
生产环境加 Docker 容器化，配置并发参数

老王的医疗系统现在稳定运行了半年多，模型跑在院内服务器上，合规没问题，效果也让临床医生满意。他说唯一的遗憾是："早知道这么简单，早半年就做了。"