Playwright Python 实战——浏览器自动化、登录状态保持、数据采集

老张2026/4/30大约 7 分钟

Playwright Python 实战——浏览器自动化、登录状态保持、数据采集

适读人群：Python 爬虫开发者、自动化测试工程师 | 阅读时长：约16分钟 | 核心价值：Playwright 从入门到工程化落地的完整路径

从一次"加班到凌晨两点"说起

我的前同事阿磊，后来跳槽去了一家金融数据公司，有天深夜发消息给我，说他快崩溃了。

他们需要每天从某个需要登录的金融数据平台采集数据，之前用 Selenium 做的方案，已经稳定运行了半年，但最近平台升级了验证机制——每次登录都要解验证码，Selenium 驱动老版本 ChromeDriver 还出现了各种兼容问题，脚本三天两头崩，每次都要他手动去修，折腾了几个星期，终于有一天他在凌晨两点给我发消息："张哥，我受不了了，有没有更稳定的方案？"

我说：换 Playwright 吧。

他半信半疑，花了一个周末迁移，然后发来一句话："这东西也太现代了，我之前白白用了三年 Selenium。"

Playwright 是微软出品的现代化浏览器自动化框架，2020年才发布，比 Selenium 晚了整整15年，但在架构设计、API 易用性、稳定性上全面超越。今天这篇，我就来系统讲讲 Playwright Python 版本的工程化实战。

一、Playwright vs Selenium：为什么我推荐切换

先把两者的差异讲清楚，不然你没有切换的动力：

对比维度	Selenium 4	Playwright
架构	WebDriver 协议，有网络往返延迟	CDP/WebSocket 直连，延迟极低
安装	需手动管理 ChromeDriver 版本	`playwright install` 一键安装
等待机制	需手动写 `WebDriverWait`	内置 auto-wait，自动等待元素可交互
多标签页	难用	原生支持，直观
网络拦截	不支持	原生支持 route 拦截
截图/PDF	支持	支持，且更稳定
异步支持	需借助 seleniumbase	原生 async/await
浏览器支持	Chrome/Firefox/Edge	Chromium/Firefox/WebKit

结论：新项目直接上 Playwright，老项目有空就迁移。

二、快速上手

安装

pip install playwright
playwright install chromium  # 只装 Chromium 够用
# 或者安装所有浏览器
playwright install

同步 vs 异步 API

Playwright 提供两套 API，同步版本适合脚本，异步版本适合高并发：

# 同步版本（简单脚本推荐）
from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    title = page.title()
    print(f"页面标题: {title}")
    browser.close()

# 异步版本（高并发推荐）
import asyncio
from playwright.async_api import async_playwright


async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto("https://example.com")
        title = await page.title()
        print(f"页面标题: {title}")
        await browser.close()


asyncio.run(main())

三、登录状态保持——这才是精华

踩坑实录1：每次都重新登录，效率极低

最初的方案是每次启动爬虫都重新执行登录流程——填用户名、填密码、点登录——耗时约3-8秒，还容易触发风控。

现象：高频运行时触发账号异常提示，偶尔需要短信验证。
原因：异常的登录频率被安全系统识别为异常行为。
解法：用 storage_state 保存登录状态，后续复用。

import asyncio
import json
from pathlib import Path
from playwright.async_api import async_playwright


AUTH_FILE = Path("auth_state.json")


async def login_and_save_state():
    """首次登录，保存认证状态"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)  # 首次登录建议有头
        context = await browser.new_context()
        page = await context.new_page()

        await page.goto("https://example.com/login")

        # 填写登录表单
        await page.fill("#username", "your_username")
        await page.fill("#password", "your_password")
        await page.click("#login-btn")

        # 等待登录成功（等待跳转或特定元素出现）
        await page.wait_for_url("**/dashboard", timeout=15000)
        print("登录成功")

        # 保存完整浏览器状态（包括 Cookies、LocalStorage、SessionStorage）
        await context.storage_state(path=str(AUTH_FILE))
        print(f"认证状态已保存到 {AUTH_FILE}")

        await browser.close()


async def crawl_with_saved_state():
    """使用已保存的登录状态爬取数据"""
    if not AUTH_FILE.exists():
        print("未找到认证状态文件，请先执行登录")
        await login_and_save_state()

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)

        # 加载已保存的认证状态
        context = await browser.new_context(
            storage_state=str(AUTH_FILE),
            viewport={"width": 1920, "height": 1080},
            # 模拟真实用户环境
            user_agent=(
                "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
        )
        page = await context.new_page()

        # 验证登录状态是否有效
        await page.goto("https://example.com/dashboard")
        if "login" in page.url:
            print("登录状态已过期，重新登录...")
            await browser.close()
            AUTH_FILE.unlink()  # 删除失效状态
            await crawl_with_saved_state()
            return

        # 正常爬取
        results = []
        for page_num in range(1, 11):
            await page.goto(f"https://example.com/data?page={page_num}")
            await page.wait_for_selector(".data-row", timeout=10000)

            rows = await page.query_selector_all(".data-row")
            for row in rows:
                title = await row.query_selector(".title")
                value = await row.query_selector(".value")
                results.append({
                    "title": await title.inner_text() if title else "",
                    "value": await value.inner_text() if value else "",
                })

            print(f"第 {page_num} 页: 获取 {len(rows)} 条数据")

        await browser.close()
        return results


# 检查登录状态是否过期的更健壮方案
async def is_logged_in(page) -> bool:
    """通过检查特定元素判断是否已登录"""
    try:
        # 假设已登录后存在用户头像元素
        element = await page.query_selector(".user-avatar", timeout=3000)
        return element is not None
    except Exception:
        return False

四、网络请求拦截——Playwright 的杀手锏

Playwright 的 route 功能可以拦截、修改、阻断任何网络请求，这在爬虫中非常有用。

踩坑实录2：页面加载慢，图片和广告拖慢速度

现象：某新闻网站页面加载需要8-10秒，大量时间花在加载图片和第三方广告脚本上。
原因：爬虫只需要文本内容，不需要图片和广告，但浏览器默认全部加载。
解法：用 route 拦截不需要的资源。

import asyncio
from playwright.async_api import async_playwright


async def smart_crawl():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()
        page = await context.new_page()

        # 拦截不需要的资源类型
        BLOCKED_RESOURCE_TYPES = {"image", "font", "media", "stylesheet"}
        BLOCKED_DOMAINS = {
            "googletagmanager.com",
            "google-analytics.com",
            "doubleclick.net",
            "facebook.net",
        }

        async def block_unnecessary(route, request):
            if request.resource_type in BLOCKED_RESOURCE_TYPES:
                await route.abort()
                return
            if any(domain in request.url for domain in BLOCKED_DOMAINS):
                await route.abort()
                return
            await route.continue_()

        await page.route("**/*", block_unnecessary)

        # 现在页面加载速度提升 3-5 倍
        start = asyncio.get_event_loop().time()
        await page.goto("https://news.example.com", wait_until="domcontentloaded")
        elapsed = asyncio.get_event_loop().time() - start
        print(f"页面加载耗时: {elapsed:.2f}s")

        content = await page.inner_text("article.news-content")
        print(content[:200])

        await browser.close()


# 更进阶：拦截 API 请求，修改返回数据（用于测试）
async def intercept_api():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=False)
        page = await browser.new_page()

        async def mock_api(route, request):
            if "/api/user/info" in request.url:
                # 返回 mock 数据
                await route.fulfill(
                    status=200,
                    content_type="application/json",
                    body='{"name": "测试用户", "level": "VIP"}'
                )
            else:
                await route.continue_()

        await page.route("**/*", mock_api)
        await page.goto("https://example.com")
        await browser.close()

五、并发爬取——多 Page 并行

import asyncio
from playwright.async_api import async_playwright


async def fetch_one(context, url: str) -> dict:
    """单页面爬取"""
    page = await context.new_page()
    try:
        await page.goto(url, wait_until="domcontentloaded", timeout=30000)
        await page.wait_for_selector(".product-title", timeout=10000)
        return {
            "url": url,
            "title": await page.inner_text(".product-title"),
            "price": await page.inner_text(".product-price"),
        }
    except Exception as e:
        return {"url": url, "error": str(e)}
    finally:
        await page.close()


async def batch_crawl(urls: list[str], concurrency: int = 5) -> list[dict]:
    """并发批量爬取"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        context = await browser.new_context()

        # 使用 Semaphore 控制并发数
        semaphore = asyncio.Semaphore(concurrency)

        async def fetch_with_limit(url):
            async with semaphore:
                return await fetch_one(context, url)

        results = await asyncio.gather(
            *[fetch_with_limit(url) for url in urls],
            return_exceptions=True
        )

        await browser.close()
        return [r for r in results if isinstance(r, dict)]


# 使用示例
urls = [f"https://example.com/product/{i}" for i in range(1, 51)]
results = asyncio.run(batch_crawl(urls, concurrency=5))
print(f"成功爬取 {len(results)} 条数据")

踩坑实录3：并发太高导致内存溢出

现象：并发数设到20，运行30分钟后进程被 OOM Killer 杀掉。
原因：每个 Chromium Page 消耗约100-200MB 内存，20个并发就是2-4GB。
解法：根据机器内存调整并发数，建议每4GB可用内存设1个并发。

import psutil


def get_safe_concurrency() -> int:
    """根据可用内存动态计算安全并发数"""
    available_gb = psutil.virtual_memory().available / (1024 ** 3)
    # 每个 Page 预估 200MB，留 1GB 给系统
    safe_count = max(1, int((available_gb - 1) / 0.2))
    return min(safe_count, 10)  # 最多10并发

六、反检测——绕过 Bot 检测

async def stealth_browser():
    """反检测浏览器配置"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(
            headless=True,
            args=[
                "--no-sandbox",
                "--disable-blink-features=AutomationControlled",
                "--disable-infobars",
            ],
        )
        context = await browser.new_context(
            viewport={"width": 1366, "height": 768},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/122.0.0.0 Safari/537.36"
            ),
            # 设置地理位置（如果需要）
            geolocation={"latitude": 39.9042, "longitude": 116.4074},
            permissions=["geolocation"],
            # 设置时区
            timezone_id="Asia/Shanghai",
        )

        page = await context.new_page()

        # 注入反检测脚本
        await page.add_init_script("""
            // 隐藏 webdriver 标志
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
            // 伪造 plugins（真实浏览器有插件）
            Object.defineProperty(navigator, 'plugins', {
                get: () => [1, 2, 3, 4, 5]
            });
            // 伪造语言
            Object.defineProperty(navigator, 'languages', {
                get: () => ['zh-CN', 'zh', 'en']
            });
        """)

        await page.goto("https://bot.sannysoft.com")
        await page.screenshot(path="bot_check.png")
        await browser.close()

七、选型建议

用 Playwright 的场景：

需要保持复杂登录状态
页面有大量 JavaScript 交互
需要网络请求拦截/Mock
现有 Selenium 代码频繁出问题

不适合用 Playwright 的场景：

静态页面或有现成 API（直接用 requests）
需要超高并发（考虑 Scrapy 框架）
服务器内存极度受限

阿磊后来把他们的数据采集系统全面迁移到了 Playwright，稳定运行了三个月，再没在深夜给我发过求救消息。这就是选对工具的价值。