第2283篇：AI Computer Use的工程探索——让AI操控浏览器的工程挑战

老张2026/4/30大约 7 分钟

第2283篇：AI Computer Use的工程探索——让AI操控浏览器的工程挑战

适读人群：对AI自动化和RPA感兴趣的工程师 | 阅读时长：约15分钟 | 核心价值：理解Computer Use的工程本质，评估生产落地的可行性与风险

去年我看到Anthropic发布Claude的Computer Use能力，第一反应是：这东西感觉比Agent更让我兴奋，也更让我忐忑。

兴奋的原因很直接——我们公司有一大堆老系统，没有API，有的是二十年前的Java Swing界面，有的是需要VPN才能访问的内部Web应用。这些系统的RPA集成开发成本极高，因为界面一改，整套脚本就废了。如果AI能直接"看着屏幕操作"，这个问题从根上就解了。

忐忑的原因也很直接——一个能操控你电脑的AI，稍微出点偏差，删个错误的文件、发出不该发的邮件，损失可能是灾难性的。

我花了几周时间认真探索了Computer Use的工程可行性，这篇文章把我的发现和判断都写出来。

Computer Use的工作原理

Computer Use本质上是一个工具调用循环，只是工具变成了"截屏+鼠标键盘操作"：

Claude的Computer Use提供了三个核心工具：

computer：截屏、鼠标操作（点击、拖拽、滚动）、键盘输入
bash：执行shell命令
text_editor：读写文件

在Java里调用Computer Use的基础框架：

@Service
public class ComputerUseOrchestrator {
    
    private final AnthropicClient anthropicClient;
    private final ScreenCaptureService screenCapture;
    private final MouseKeyboardController inputController;
    
    public String executeTask(String taskDescription) throws Exception {
        List<Message> messages = new ArrayList<>();
        messages.add(Message.user(taskDescription));
        
        // Computer Use工具定义
        List<Tool> tools = buildComputerUseTools();
        
        // 任务执行循环
        int maxIterations = 50;  // 防止无限循环
        for (int i = 0; i < maxIterations; i++) {
            
            MessageResponse response = anthropicClient.messages().create(
                MessageCreateParams.builder()
                    .model("claude-opus-4-5")  // 需要支持vision的模型
                    .maxTokens(4096)
                    .tools(tools)
                    .messages(messages)
                    .build()
            );
            
            // 把AI的响应加入对话历史
            messages.add(Message.assistant(response.getContent()));
            
            if ("end_turn".equals(response.getStopReason())) {
                // 任务完成
                return extractFinalAnswer(response);
            }
            
            if ("tool_use".equals(response.getStopReason())) {
                // 执行工具调用
                List<ToolResultContent> toolResults = executeToolCalls(response.getContent());
                messages.add(Message.user(toolResults));
                // 继续循环
            }
        }
        
        throw new RuntimeException("任务超过最大迭代次数未完成");
    }
    
    private List<ToolResultContent> executeToolCalls(List<ContentBlock> blocks) throws Exception {
        List<ToolResultContent> results = new ArrayList<>();
        
        for (ContentBlock block : blocks) {
            if (!(block instanceof ToolUseBlock toolUse)) continue;
            
            String toolResult = switch (toolUse.getName()) {
                case "computer" -> handleComputerTool(toolUse.getInput());
                case "bash" -> handleBashTool(toolUse.getInput());
                case "text_editor" -> handleTextEditorTool(toolUse.getInput());
                default -> "未知工具: " + toolUse.getName();
            };
            
            results.add(ToolResultContent.builder()
                .toolUseId(toolUse.getId())
                .content(toolResult)
                .build());
        }
        
        return results;
    }
    
    private String handleComputerTool(Map<String, Object> input) throws Exception {
        String action = (String) input.get("action");
        
        return switch (action) {
            case "screenshot" -> {
                // 截图并转为base64
                byte[] screenshot = screenCapture.capture();
                String base64 = Base64.getEncoder().encodeToString(screenshot);
                yield buildImageResult(base64, "image/png");
            }
            case "left_click" -> {
                int x = (int) input.get("coordinate_x");
                int y = (int) input.get("coordinate_y");
                inputController.click(x, y);
                Thread.sleep(500);  // 等待界面响应
                yield "点击完成";
            }
            case "type" -> {
                String text = (String) input.get("text");
                inputController.type(text);
                Thread.sleep(200);
                yield "输入完成";
            }
            case "scroll" -> {
                int x = (int) input.get("coordinate_x");
                int y = (int) input.get("coordinate_y");
                String direction = (String) input.get("direction");
                int amount = (int) input.getOrDefault("amount", 3);
                inputController.scroll(x, y, direction, amount);
                Thread.sleep(300);
                yield "滚动完成";
            }
            case "key" -> {
                String key = (String) input.get("key");
                inputController.pressKey(key);
                Thread.sleep(200);
                yield "按键完成";
            }
            default -> "不支持的操作: " + action;
        };
    }
}

屏幕捕获和鼠标控制的Java实现

这里是真正在工程上有难度的地方：

@Component
public class ScreenCaptureService {
    
    private final Robot robot;
    
    public ScreenCaptureService() throws AWTException {
        this.robot = new Robot();
    }
    
    public byte[] capture() throws IOException {
        // 获取屏幕尺寸
        Dimension screenSize = Toolkit.getDefaultToolkit().getScreenSize();
        Rectangle screenRect = new Rectangle(screenSize);
        
        // 截图
        BufferedImage screenshot = robot.createScreenCapture(screenRect);
        
        // 缩放到合理尺寸（太大会浪费token）
        // Claude建议的分辨率是1024x768或类似比例
        BufferedImage scaled = scaleImage(screenshot, 1366, 768);
        
        // 转为PNG字节数组
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ImageIO.write(scaled, "PNG", baos);
        return baos.toByteArray();
    }
    
    private BufferedImage scaleImage(BufferedImage original, int targetWidth, int targetHeight) {
        // 保持宽高比缩放
        double scaleX = (double) targetWidth / original.getWidth();
        double scaleY = (double) targetHeight / original.getHeight();
        double scale = Math.min(scaleX, scaleY);
        
        int newWidth = (int) (original.getWidth() * scale);
        int newHeight = (int) (original.getHeight() * scale);
        
        BufferedImage scaled = new BufferedImage(newWidth, newHeight, BufferedImage.TYPE_INT_RGB);
        Graphics2D g = scaled.createGraphics();
        g.setRenderingHint(RenderingHints.KEY_INTERPOLATION, 
                           RenderingHints.VALUE_INTERPOLATION_BILINEAR);
        g.drawImage(original, 0, 0, newWidth, newHeight, null);
        g.dispose();
        return scaled;
    }
}

@Component
public class MouseKeyboardController {
    
    private final Robot robot;
    
    // 坐标缩放比例（因为截图做了缩放，AI给的坐标需要还原）
    private double scaleX;
    private double scaleY;
    
    public void click(int scaledX, int scaledY) throws InterruptedException {
        // 将AI给的缩放坐标转换为实际屏幕坐标
        int actualX = (int) (scaledX / scaleX);
        int actualY = (int) (scaledY / scaleY);
        
        robot.mouseMove(actualX, actualY);
        Thread.sleep(100);
        robot.mousePress(InputEvent.BUTTON1_DOWN_MASK);
        Thread.sleep(50);
        robot.mouseRelease(InputEvent.BUTTON1_DOWN_MASK);
    }
    
    public void type(String text) throws InterruptedException {
        // 逐字符输入（处理特殊字符更可靠）
        for (char c : text.toCharArray()) {
            typeChar(c);
            Thread.sleep(30);  // 适当延迟，防止应用接收不过来
        }
    }
    
    private void typeChar(char c) {
        // 处理大写字母
        if (Character.isUpperCase(c) || "!@#$%^&*()_+{}|:\"<>?".indexOf(c) >= 0) {
            robot.keyPress(KeyEvent.VK_SHIFT);
            robot.keyPress(getKeyCode(Character.toLowerCase(c)));
            robot.keyRelease(getKeyCode(Character.toLowerCase(c)));
            robot.keyRelease(KeyEvent.VK_SHIFT);
        } else {
            robot.keyPress(getKeyCode(c));
            robot.keyRelease(getKeyCode(c));
        }
    }
}

工程落地的真实挑战

做了几周探索后，我整理出了Computer Use在生产场景落地的几个核心挑战：

挑战1：Token成本高得吓人

每一轮操作都需要截图，一张1366×768的PNG图片，base64之后大约是1.5-2MB字符，对应的token数量在500-1000之间。一个稍微复杂的任务（10-20步操作），光图像部分的输入token就要几千到一万。加上多轮对话的上下文积累，完成一个任务可能花掉5-10美元。

这个成本对于真正的自动化场景来说基本不可接受。降低成本的思路：

限制截图频率，不是每次都截全屏
操作成功后裁剪感兴趣区域截图
对成熟的、固定的操作步骤做缓存，避免重复的AI决策

挑战2：操作的不确定性

AI有时会在坐标计算上出错，特别是高分辨率屏幕。有时会陷入循环（重复尝试同一个失败的操作）。有时会做出超出预期的操作（比如你说"删除这条记录"，它选择了全选后删除）。

必须建立完善的操作回滚机制：

@Component
public class SafetyGuard {
    
    private final List<ReversibleAction> actionHistory = new ArrayList<>();
    
    // 高风险操作白名单外都要确认
    private static final Set<String> HIGH_RISK_OPERATIONS = Set.of(
        "delete", "remove", "drop", "truncate", "format", "overwrite"
    );
    
    public boolean shouldRequireConfirmation(String action, String context) {
        // 检查是否包含高风险操作关键词
        String lowerAction = action.toLowerCase();
        return HIGH_RISK_OPERATIONS.stream().anyMatch(lowerAction::contains);
    }
    
    // 沙箱模式：操作前截图，操作后对比
    public OperationResult executeWithRollbackSupport(
            Runnable operation, 
            String description) throws Exception {
        
        // 操作前状态快照
        byte[] beforeScreenshot = screenCapture.capture();
        
        try {
            operation.run();
            byte[] afterScreenshot = screenCapture.capture();
            
            // 记录操作（用于后续回滚）
            actionHistory.add(new ReversibleAction(description, beforeScreenshot));
            
            return OperationResult.success(afterScreenshot);
        } catch (Exception e) {
            // 操作失败，记录但不自动回滚（需要人工判断）
            log.error("操作执行失败: {}", description, e);
            return OperationResult.failure(e.getMessage());
        }
    }
}

挑战3：多显示器和DPI缩放问题

在配置了HiDPI（比如macOS Retina屏）或多显示器的机器上，坐标系统会变得复杂。实际像素和逻辑像素不一致，AI给的坐标和实际点击位置会有偏差：

// 获取正确的显示缩放比例
GraphicsEnvironment ge = GraphicsEnvironment.getLocalGraphicsEnvironment();
GraphicsDevice gd = ge.getDefaultScreenDevice();
GraphicsConfiguration gc = gd.getDefaultConfiguration();
AffineTransform at = gc.getDefaultTransform();

double dpiScaleX = at.getScaleX();  // Retina屏上通常是2.0
double dpiScaleY = at.getScaleY();

// 截图时需要除以DPI缩放才能得到逻辑坐标
// AI点击时需要乘以DPI缩放才能得到物理坐标

我的工程判断：当前适合和不适合的场景

做完这些探索，我对Computer Use的工程定位有了比较清晰的判断：

当前适合用Computer Use的场景：

内部工具，操作频率低（每天几十次而非几千次），成本可接受
操作目标界面相对固定，不需要太多创意性决策
有人工监督，AI给出操作计划后人确认再执行（Human-in-the-loop）
传统RPA无法处理的动态界面（比如内容每次都不同的报表界面）

当前不适合的场景：

高频自动化任务（成本问题）
需要操作权限非常敏感的系统（安全风险）
要求操作100%准确、不能有任何差错的场景
响应时间要求低于5秒的场景

Computer Use代表了一个重要方向：AI从"生成文本"到"操控数字世界"。但现在还是非常早期，工程上的不确定性很高。如果你有探索这个方向的计划，建议先在受控的、低风险的内部场景里试，积累经验再考虑扩大。