Node.js MySQL 连接池打满事故复盘:6 大坑和真实修法

大促当晚 Node.js + MySQL 连接池打满事故全复盘:连接数公式 + 慢查询识别 + 事务持有 + prepared 复用 + 连接泄漏 + 读写分离 + max_user_connections + 监控告警。从 12% 错误率降到 0.02%,p99 5s→80ms。

2024 年初我们一个 Node.js + MySQL 服务在大促当晚崩了:DB 连接池打满,后端 500% CPU。复盘发现是连接池配置 + 慢查询 + 事务持有连接太久三个问题叠加。本文复盘全流程,讲透 Node.js MySQL 连接池的所有坑:连接数计算公式、慢查询识别、prepared statement 复用、长事务避免、连接池监控告警。

事故复盘

事故时间:2024-02-14 19:32 - 20:18
影响:大促首小时,订单 API 100% 超时
原因层级:
  1. 业务高峰:订单 QPS 从 200 飙到 3000
  2. 慢查询变多:join 查询从 50ms 升到 800ms
  3. 连接池耗尽:200 连接全占,新请求等 5s 超时
  4. 应用 CPU 100%:大量请求在 await getConnection() 排队

错误堆栈:
PoolClosedError: Pool is closed
    at Pool._purge (mysql2/lib/pool.js:235)

ER_USER_LIMIT_REACHED: User '...' has exceeded the 'max_user_connections' resource

connect ETIMEDOUT
    at Connection._handleTimeoutError (mysql2/lib/connection.js:215)

Node.js MySQL 连接池基础

// mysql2/promise 连接池
import mysql from 'mysql2/promise';

const pool = mysql.createPool({
    host: 'mysql.internal',
    user: 'app',
    password: process.env.DB_PASSWORD,
    database: 'orders',
    waitForConnections: true,        // 满了排队还是报错
    connectionLimit: 50,             // 最大连接数(关键!)
    queueLimit: 0,                   // 队列最大长度,0=无限
    enableKeepAlive: true,
    keepAliveInitialDelay: 10000,
    idleTimeout: 60000,              // 空闲连接 60s 后回收
    maxIdle: 10                      // 至多保留 10 个空闲
});

// 使用
async function getUser(id: number) {
    const [rows] = await pool.execute(
        'SELECT * FROM users WHERE id = ?',
        [id]
    );
    return rows;
}

// 事务
async function transfer(from: number, to: number, amount: number) {
    const conn = await pool.getConnection();
    try {
        await conn.beginTransaction();
        await conn.execute('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, from]);
        await conn.execute('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, to]);
        await conn.commit();
    } catch (err) {
        await conn.rollback();
        throw err;
    } finally {
        conn.release();   // 必须 release,不然连接泄漏!
    }
}

连接数计算公式

推荐公式:
connectionLimit = (Avg QPS × Avg Query Time) / N
其中 N = Node.js 实例数

例子:
- 业务峰值 QPS = 3000
- 平均 query 耗时 = 50ms
- Node.js 实例 = 5(K8s 5 Pod)
- 单实例 connectionLimit = (3000 × 0.05) / 5 = 30

加 50% buffer = 45 → 取 50(向上取整到 10 的倍数)

服务端 MySQL max_connections 公式:
max_connections >= Node.js 实例数 × connectionLimit × 1.2
                = 5 × 50 × 1.2 = 300

# 我们事故时配置:
# Node.js 5 实例,connectionLimit=200 → 总共 1000 连接
# MySQL max_connections=200 → 完全顶不住
# 结果:MySQL 拒绝新连接

坑 1:慢查询拖垮连接池

-- 业务查询(连接占用 800ms)
SELECT
    o.*, u.name, u.email,
    (SELECT COUNT(*) FROM order_items WHERE order_id = o.id) AS item_count
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid'
  AND o.created_at >= '2024-02-14'
ORDER BY o.created_at DESC
LIMIT 100;

-- EXPLAIN:
-- type=ALL(全表扫),Rows=1200000
-- 缺索引,且子查询是 N+1

-- 修复 1:加复合索引
ALTER TABLE orders ADD INDEX idx_status_created (status, created_at);
ALTER TABLE order_items ADD INDEX idx_order_id (order_id);

-- 修复 2:消除 N+1,改 JOIN
SELECT
    o.*, u.name, u.email,
    COALESCE(oi.item_count, 0) AS item_count
FROM orders o
JOIN users u ON u.id = o.user_id
LEFT JOIN (
    SELECT order_id, COUNT(*) AS item_count
    FROM order_items
    GROUP BY order_id
) oi ON oi.order_id = o.id
WHERE o.status = 'paid'
  AND o.created_at >= '2024-02-14'
ORDER BY o.created_at DESC
LIMIT 100;

-- 查询从 800ms → 30ms

坑 2:事务持有连接太久

// 错误示范:事务里有外部 HTTP 调用
async function createOrder(data: OrderData) {
    const conn = await pool.getConnection();
    try {
        await conn.beginTransaction();

        // 1. 写订单
        const [result] = await conn.execute(
            'INSERT INTO orders SET ?', [data]
        );
        const orderId = (result as any).insertId;

        // 2. 调外部支付服务 ❌ 占用连接 200-500ms
        const payment = await axios.post('https://payment.internal/charge', {
            amount: data.amount, orderId
        });

        // 3. 更新订单状态
        await conn.execute(
            'UPDATE orders SET payment_id = ? WHERE id = ?',
            [payment.data.id, orderId]
        );

        await conn.commit();
    } finally {
        conn.release();
    }
}
// 单笔订单占连接 500ms,3000 QPS → 1500 连接需求
// 正确做法:事务尽量短,外部调用放事务外
async function createOrder(data: OrderData) {
    // 1. 短事务:只写订单
    let orderId: number;
    const conn = await pool.getConnection();
    try {
        await conn.beginTransaction();
        const [result] = await conn.execute(
            'INSERT INTO orders (..., status) VALUES (..., "pending")',
            [...]
        );
        orderId = (result as any).insertId;
        await conn.commit();
    } finally {
        conn.release();
    }

    // 2. 事务外调外部服务
    const payment = await axios.post('https://payment.internal/charge', {
        amount: data.amount, orderId
    });

    // 3. 短事务:更新状态(用 pool.execute 不占连接长)
    await pool.execute(
        'UPDATE orders SET payment_id = ?, status = "paid" WHERE id = ?',
        [payment.data.id, orderId]
    );
}
// 单笔订单只占连接 20ms 两次,大幅减少持有时间

坑 3:prepared statement 没复用

// 错误:每次都 prepare
async function findUser(id: number) {
    const stmt = await pool.prepare('SELECT * FROM users WHERE id = ?');
    try {
        const [rows] = await stmt.execute([id]);
        return rows;
    } finally {
        await stmt.close();   // close 后下次还要 prepare
    }
}

// 正确:用 pool.execute(底层自动缓存 prepared statement)
async function findUser(id: number) {
    const [rows] = await pool.execute(
        'SELECT * FROM users WHERE id = ?',
        [id]
    );
    return rows;
}

// mysql2 配置:启用 prepared statement 缓存
const pool = mysql.createPool({
    // ...
    maxPreparedStatements: 16000,   // 缓存上限
    namedPlaceholders: true          // 支持 :name 占位
});

坑 4:连接泄漏

// 经典错误:抛异常没 release
async function badFunction() {
    const conn = await pool.getConnection();
    const [rows] = await conn.execute('SELECT * FROM users');
    if (rows.length === 0) {
        throw new Error('no user');   // ❌ 没 release,连接泄漏!
    }
    conn.release();
    return rows;
}

// 修法 1:try/finally
async function goodFunction() {
    const conn = await pool.getConnection();
    try {
        const [rows] = await conn.execute('SELECT * FROM users');
        if (rows.length === 0) {
            throw new Error('no user');
        }
        return rows;
    } finally {
        conn.release();
    }
}

// 修法 2:封装高阶函数
async function withConnection(fn: (conn: PoolConnection) => Promise): Promise {
    const conn = await pool.getConnection();
    try {
        return await fn(conn);
    } finally {
        conn.release();
    }
}

await withConnection(async (conn) => {
    const [rows] = await conn.execute('SELECT * FROM users');
    return rows;
});

// 修法 3:事务封装
async function withTransaction(fn: (conn: PoolConnection) => Promise): Promise {
    const conn = await pool.getConnection();
    try {
        await conn.beginTransaction();
        const result = await fn(conn);
        await conn.commit();
        return result;
    } catch (err) {
        await conn.rollback();
        throw err;
    } finally {
        conn.release();
    }
}

坑 5:read replica 没用

// 主库 + 多个从库的读写分离
const masterPool = mysql.createPool({ host: 'mysql-master', ...});
const slavePool = mysql.createPool({
    host: 'mysql-slave',
    connectionLimit: 100   // 从库连接数可以更多
});

class DB {
    // 写走主库
    async insert(table: string, data: any) {
        return masterPool.execute(`INSERT INTO ${table} SET ?`, [data]);
    }

    async update(sql: string, params: any[]) {
        return masterPool.execute(sql, params);
    }

    // 读优先走从库
    async query(sql: string, params: any[] = []) {
        try {
            return await slavePool.execute(sql, params);
        } catch (err) {
            // 从库挂了降级主库
            console.warn('Slave query failed, fallback to master');
            return masterPool.execute(sql, params);
        }
    }

    // 强一致读:走主库
    async queryConsistent(sql: string, params: any[] = []) {
        return masterPool.execute(sql, params);
    }
}

// 使用
const db = new DB();
const users = await db.query('SELECT * FROM users WHERE id = ?', [123]);  // 走从库
await db.insert('orders', { user_id: 123, amount: 100 });                 // 走主库

坑 6:max_user_connections 限制

-- MySQL 用户配额
SHOW VARIABLES LIKE 'max_connections';        -- 全局上限
SHOW VARIABLES LIKE 'max_user_connections';   -- 单用户上限

-- 看当前用户配额
SELECT * FROM mysql.user WHERE User = 'app';
-- max_user_connections | max_connections | max_questions

-- 修改单用户配额
ALTER USER 'app'@'%' WITH MAX_USER_CONNECTIONS 500;

-- 当前活跃连接
SHOW PROCESSLIST;
SELECT host, user, COUNT(*) FROM information_schema.processlist GROUP BY host, user;

连接池监控

// 自定义 metrics
import { Counter, Gauge, Histogram } from 'prom-client';

const dbConnAcquired = new Counter({
    name: 'db_conn_acquired_total',
    help: 'DB connections acquired'
});

const dbConnReleased = new Counter({
    name: 'db_conn_released_total',
    help: 'DB connections released'
});

const dbConnActive = new Gauge({
    name: 'db_conn_active',
    help: 'Current active connections'
});

const dbConnWaiting = new Gauge({
    name: 'db_conn_waiting',
    help: 'Requests waiting for connection'
});

const dbQueryDuration = new Histogram({
    name: 'db_query_duration_seconds',
    help: 'Query duration',
    labelNames: ['operation', 'table'],
    buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

// 包装查询函数
async function query(sql: string, params: any[] = [], operation = 'select', table = 'unknown') {
    const end = dbQueryDuration.startTimer({ operation, table });
    dbConnAcquired.inc();
    dbConnActive.inc();
    try {
        const [rows] = await pool.execute(sql, params);
        return rows;
    } finally {
        dbConnActive.dec();
        dbConnReleased.inc();
        end();
    }
}

// 定期采集 pool 内部状态
setInterval(() => {
    const stats = (pool as any).pool;   // mysql2 pool 内部对象
    dbConnActive.set(stats._allConnections.length);
    dbConnWaiting.set(stats._connectionQueue.length);
}, 5000);

告警规则

groups:
  - name: nodejs_db
    rules:
      # 连接池使用率高
      - alert: DBConnPoolHigh
        expr: db_conn_active / 50 > 0.8
        for: 2m
        labels: { severity: warning }

      # 有请求在排队等连接
      - alert: DBConnQueueing
        expr: db_conn_waiting > 10
        for: 1m
        labels: { severity: critical }

      # 慢查询变多
      - alert: DBSlowQuery
        expr: histogram_quantile(0.99, db_query_duration_seconds) > 0.5
        for: 5m
        labels: { severity: warning }

      # 连接失败
      - alert: DBConnError
        expr: rate(db_conn_error_total[5m]) > 0.1
        for: 2m
        labels: { severity: critical }

事故后改进

配置调整:
- connectionLimit: 200 → 50(单实例)
- MySQL max_connections: 200 → 500
- MySQL max_user_connections: 0 → 500

业务改造:
- 事务里外部调用全部挪出去(异步消息)
- 慢查询索引补齐
- 读写分离,读 80% 走从库
- 连接释放统一用 withConnection 高阶函数封装

架构调整:
- 大促前预扩容:Node.js 5 → 10 实例
- MySQL 主库 + 3 从库(从 1 变 3)
- 上 PgBouncer 风格的中间代理(ProxySQL)统一连接管理

监控告警:
- 连接池使用率告警阈值 80%
- 慢查询 p99 > 100ms 告警
- 连接等待队列 > 10 告警

效果:
- 大促当晚平稳:订单 5000 QPS,DB 连接使用率峰值 60%
- p99 延迟从 5s → 80ms
- 错误率从 12% → 0.02%

避坑清单

  1. connectionLimit 公式:QPS × 平均耗时 / 实例数 + 50% buffer
  2. MySQL max_connections 要 > Node.js 总连接数 × 1.2
  3. 事务里不要有外部 HTTP 调用,缩短连接持有
  4. 用 try/finally 或 withConnection 封装,杜绝连接泄漏
  5. 慢查询要 EXPLAIN,加索引或重写
  6. 读写分离,从库分担读流量
  7. 用 pool.execute(自动 prepared 缓存),不要手动 prepare/close
  8. monitor 连接池活跃数 / 等待队列 / 查询延迟
  9. 事故前要做压测,知道连接池的极限
  10. 设置 idleTimeout 避免长时间空闲连接被 MySQL kill

总结

Node.js MySQL 连接池是个"配低了顶不住,配高了 DB 拒绝"的精细活。connectionLimit 不是越大越好,跟 MySQL max_connections 必须协调。事故根因往往不是单点,而是几个问题叠加:慢查询 + 长事务 + 配置不当 = 雪崩。最大的认知改变:连接池不是黑盒,要持续监控连接活跃数 / 等待队列 / 查询时长。大促前一定要压测,把"理论上撑得住"变成"实测撑得住"。这次事故让我们把 DB 监控纳入了 SRE 平台的核心面板。

—— 别看了 · 2026
声明:本站所有文章,如无特殊说明或标注,均为本站原创发布。任何个人或组织,在未征得本站同意时,禁止复制、盗用、采集、发布本站内容到任何网站、书籍等各类媒体平台。如若本站内容侵犯了原著者的合法权益,可联系我们进行处理 邮箱1846861578@qq.com。
技术教程

Vue 2 升 Vue 3 + Vite + TS 五个月实录:22 万行代码 9 个坑

2026-5-19 12:27:02

技术教程

JVM G1 切 ZGC 两周调参实录:p99 GC 暂停 80ms→3ms

2026-5-19 12:32:01

0 条回复 A文章作者 M管理员
    暂无讨论,说说你的看法吧
个人中心
购物车
优惠劵
今日签到
有新私信 私信列表
搜索