2024 年初我们一个 Node.js + MySQL 服务在大促当晚崩了:DB 连接池打满,后端 500% CPU。复盘发现是连接池配置 + 慢查询 + 事务持有连接太久三个问题叠加。本文复盘全流程,讲透 Node.js MySQL 连接池的所有坑:连接数计算公式、慢查询识别、prepared statement 复用、长事务避免、连接池监控告警。
事故复盘
事故时间:2024-02-14 19:32 - 20:18
影响:大促首小时,订单 API 100% 超时
原因层级:
1. 业务高峰:订单 QPS 从 200 飙到 3000
2. 慢查询变多:join 查询从 50ms 升到 800ms
3. 连接池耗尽:200 连接全占,新请求等 5s 超时
4. 应用 CPU 100%:大量请求在 await getConnection() 排队
错误堆栈:
PoolClosedError: Pool is closed
at Pool._purge (mysql2/lib/pool.js:235)
ER_USER_LIMIT_REACHED: User '...' has exceeded the 'max_user_connections' resource
connect ETIMEDOUT
at Connection._handleTimeoutError (mysql2/lib/connection.js:215)
Node.js MySQL 连接池基础
// mysql2/promise 连接池
import mysql from 'mysql2/promise';
const pool = mysql.createPool({
host: 'mysql.internal',
user: 'app',
password: process.env.DB_PASSWORD,
database: 'orders',
waitForConnections: true, // 满了排队还是报错
connectionLimit: 50, // 最大连接数(关键!)
queueLimit: 0, // 队列最大长度,0=无限
enableKeepAlive: true,
keepAliveInitialDelay: 10000,
idleTimeout: 60000, // 空闲连接 60s 后回收
maxIdle: 10 // 至多保留 10 个空闲
});
// 使用
async function getUser(id: number) {
const [rows] = await pool.execute(
'SELECT * FROM users WHERE id = ?',
[id]
);
return rows;
}
// 事务
async function transfer(from: number, to: number, amount: number) {
const conn = await pool.getConnection();
try {
await conn.beginTransaction();
await conn.execute('UPDATE accounts SET balance = balance - ? WHERE id = ?', [amount, from]);
await conn.execute('UPDATE accounts SET balance = balance + ? WHERE id = ?', [amount, to]);
await conn.commit();
} catch (err) {
await conn.rollback();
throw err;
} finally {
conn.release(); // 必须 release,不然连接泄漏!
}
}
连接数计算公式
推荐公式:
connectionLimit = (Avg QPS × Avg Query Time) / N
其中 N = Node.js 实例数
例子:
- 业务峰值 QPS = 3000
- 平均 query 耗时 = 50ms
- Node.js 实例 = 5(K8s 5 Pod)
- 单实例 connectionLimit = (3000 × 0.05) / 5 = 30
加 50% buffer = 45 → 取 50(向上取整到 10 的倍数)
服务端 MySQL max_connections 公式:
max_connections >= Node.js 实例数 × connectionLimit × 1.2
= 5 × 50 × 1.2 = 300
# 我们事故时配置:
# Node.js 5 实例,connectionLimit=200 → 总共 1000 连接
# MySQL max_connections=200 → 完全顶不住
# 结果:MySQL 拒绝新连接
坑 1:慢查询拖垮连接池
-- 业务查询(连接占用 800ms)
SELECT
o.*, u.name, u.email,
(SELECT COUNT(*) FROM order_items WHERE order_id = o.id) AS item_count
FROM orders o
JOIN users u ON u.id = o.user_id
WHERE o.status = 'paid'
AND o.created_at >= '2024-02-14'
ORDER BY o.created_at DESC
LIMIT 100;
-- EXPLAIN:
-- type=ALL(全表扫),Rows=1200000
-- 缺索引,且子查询是 N+1
-- 修复 1:加复合索引
ALTER TABLE orders ADD INDEX idx_status_created (status, created_at);
ALTER TABLE order_items ADD INDEX idx_order_id (order_id);
-- 修复 2:消除 N+1,改 JOIN
SELECT
o.*, u.name, u.email,
COALESCE(oi.item_count, 0) AS item_count
FROM orders o
JOIN users u ON u.id = o.user_id
LEFT JOIN (
SELECT order_id, COUNT(*) AS item_count
FROM order_items
GROUP BY order_id
) oi ON oi.order_id = o.id
WHERE o.status = 'paid'
AND o.created_at >= '2024-02-14'
ORDER BY o.created_at DESC
LIMIT 100;
-- 查询从 800ms → 30ms
坑 2:事务持有连接太久
// 错误示范:事务里有外部 HTTP 调用
async function createOrder(data: OrderData) {
const conn = await pool.getConnection();
try {
await conn.beginTransaction();
// 1. 写订单
const [result] = await conn.execute(
'INSERT INTO orders SET ?', [data]
);
const orderId = (result as any).insertId;
// 2. 调外部支付服务 ❌ 占用连接 200-500ms
const payment = await axios.post('https://payment.internal/charge', {
amount: data.amount, orderId
});
// 3. 更新订单状态
await conn.execute(
'UPDATE orders SET payment_id = ? WHERE id = ?',
[payment.data.id, orderId]
);
await conn.commit();
} finally {
conn.release();
}
}
// 单笔订单占连接 500ms,3000 QPS → 1500 连接需求
// 正确做法:事务尽量短,外部调用放事务外
async function createOrder(data: OrderData) {
// 1. 短事务:只写订单
let orderId: number;
const conn = await pool.getConnection();
try {
await conn.beginTransaction();
const [result] = await conn.execute(
'INSERT INTO orders (..., status) VALUES (..., "pending")',
[...]
);
orderId = (result as any).insertId;
await conn.commit();
} finally {
conn.release();
}
// 2. 事务外调外部服务
const payment = await axios.post('https://payment.internal/charge', {
amount: data.amount, orderId
});
// 3. 短事务:更新状态(用 pool.execute 不占连接长)
await pool.execute(
'UPDATE orders SET payment_id = ?, status = "paid" WHERE id = ?',
[payment.data.id, orderId]
);
}
// 单笔订单只占连接 20ms 两次,大幅减少持有时间
坑 3:prepared statement 没复用
// 错误:每次都 prepare
async function findUser(id: number) {
const stmt = await pool.prepare('SELECT * FROM users WHERE id = ?');
try {
const [rows] = await stmt.execute([id]);
return rows;
} finally {
await stmt.close(); // close 后下次还要 prepare
}
}
// 正确:用 pool.execute(底层自动缓存 prepared statement)
async function findUser(id: number) {
const [rows] = await pool.execute(
'SELECT * FROM users WHERE id = ?',
[id]
);
return rows;
}
// mysql2 配置:启用 prepared statement 缓存
const pool = mysql.createPool({
// ...
maxPreparedStatements: 16000, // 缓存上限
namedPlaceholders: true // 支持 :name 占位
});
坑 4:连接泄漏
// 经典错误:抛异常没 release
async function badFunction() {
const conn = await pool.getConnection();
const [rows] = await conn.execute('SELECT * FROM users');
if (rows.length === 0) {
throw new Error('no user'); // ❌ 没 release,连接泄漏!
}
conn.release();
return rows;
}
// 修法 1:try/finally
async function goodFunction() {
const conn = await pool.getConnection();
try {
const [rows] = await conn.execute('SELECT * FROM users');
if (rows.length === 0) {
throw new Error('no user');
}
return rows;
} finally {
conn.release();
}
}
// 修法 2:封装高阶函数
async function withConnection(fn: (conn: PoolConnection) => Promise): Promise {
const conn = await pool.getConnection();
try {
return await fn(conn);
} finally {
conn.release();
}
}
await withConnection(async (conn) => {
const [rows] = await conn.execute('SELECT * FROM users');
return rows;
});
// 修法 3:事务封装
async function withTransaction(fn: (conn: PoolConnection) => Promise): Promise {
const conn = await pool.getConnection();
try {
await conn.beginTransaction();
const result = await fn(conn);
await conn.commit();
return result;
} catch (err) {
await conn.rollback();
throw err;
} finally {
conn.release();
}
}
坑 5:read replica 没用
// 主库 + 多个从库的读写分离
const masterPool = mysql.createPool({ host: 'mysql-master', ...});
const slavePool = mysql.createPool({
host: 'mysql-slave',
connectionLimit: 100 // 从库连接数可以更多
});
class DB {
// 写走主库
async insert(table: string, data: any) {
return masterPool.execute(`INSERT INTO ${table} SET ?`, [data]);
}
async update(sql: string, params: any[]) {
return masterPool.execute(sql, params);
}
// 读优先走从库
async query(sql: string, params: any[] = []) {
try {
return await slavePool.execute(sql, params);
} catch (err) {
// 从库挂了降级主库
console.warn('Slave query failed, fallback to master');
return masterPool.execute(sql, params);
}
}
// 强一致读:走主库
async queryConsistent(sql: string, params: any[] = []) {
return masterPool.execute(sql, params);
}
}
// 使用
const db = new DB();
const users = await db.query('SELECT * FROM users WHERE id = ?', [123]); // 走从库
await db.insert('orders', { user_id: 123, amount: 100 }); // 走主库
坑 6:max_user_connections 限制
-- MySQL 用户配额
SHOW VARIABLES LIKE 'max_connections'; -- 全局上限
SHOW VARIABLES LIKE 'max_user_connections'; -- 单用户上限
-- 看当前用户配额
SELECT * FROM mysql.user WHERE User = 'app';
-- max_user_connections | max_connections | max_questions
-- 修改单用户配额
ALTER USER 'app'@'%' WITH MAX_USER_CONNECTIONS 500;
-- 当前活跃连接
SHOW PROCESSLIST;
SELECT host, user, COUNT(*) FROM information_schema.processlist GROUP BY host, user;
连接池监控
// 自定义 metrics
import { Counter, Gauge, Histogram } from 'prom-client';
const dbConnAcquired = new Counter({
name: 'db_conn_acquired_total',
help: 'DB connections acquired'
});
const dbConnReleased = new Counter({
name: 'db_conn_released_total',
help: 'DB connections released'
});
const dbConnActive = new Gauge({
name: 'db_conn_active',
help: 'Current active connections'
});
const dbConnWaiting = new Gauge({
name: 'db_conn_waiting',
help: 'Requests waiting for connection'
});
const dbQueryDuration = new Histogram({
name: 'db_query_duration_seconds',
help: 'Query duration',
labelNames: ['operation', 'table'],
buckets: [0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
// 包装查询函数
async function query(sql: string, params: any[] = [], operation = 'select', table = 'unknown') {
const end = dbQueryDuration.startTimer({ operation, table });
dbConnAcquired.inc();
dbConnActive.inc();
try {
const [rows] = await pool.execute(sql, params);
return rows;
} finally {
dbConnActive.dec();
dbConnReleased.inc();
end();
}
}
// 定期采集 pool 内部状态
setInterval(() => {
const stats = (pool as any).pool; // mysql2 pool 内部对象
dbConnActive.set(stats._allConnections.length);
dbConnWaiting.set(stats._connectionQueue.length);
}, 5000);
告警规则
groups:
- name: nodejs_db
rules:
# 连接池使用率高
- alert: DBConnPoolHigh
expr: db_conn_active / 50 > 0.8
for: 2m
labels: { severity: warning }
# 有请求在排队等连接
- alert: DBConnQueueing
expr: db_conn_waiting > 10
for: 1m
labels: { severity: critical }
# 慢查询变多
- alert: DBSlowQuery
expr: histogram_quantile(0.99, db_query_duration_seconds) > 0.5
for: 5m
labels: { severity: warning }
# 连接失败
- alert: DBConnError
expr: rate(db_conn_error_total[5m]) > 0.1
for: 2m
labels: { severity: critical }
事故后改进
配置调整:
- connectionLimit: 200 → 50(单实例)
- MySQL max_connections: 200 → 500
- MySQL max_user_connections: 0 → 500
业务改造:
- 事务里外部调用全部挪出去(异步消息)
- 慢查询索引补齐
- 读写分离,读 80% 走从库
- 连接释放统一用 withConnection 高阶函数封装
架构调整:
- 大促前预扩容:Node.js 5 → 10 实例
- MySQL 主库 + 3 从库(从 1 变 3)
- 上 PgBouncer 风格的中间代理(ProxySQL)统一连接管理
监控告警:
- 连接池使用率告警阈值 80%
- 慢查询 p99 > 100ms 告警
- 连接等待队列 > 10 告警
效果:
- 大促当晚平稳:订单 5000 QPS,DB 连接使用率峰值 60%
- p99 延迟从 5s → 80ms
- 错误率从 12% → 0.02%
避坑清单
- connectionLimit 公式:QPS × 平均耗时 / 实例数 + 50% buffer
- MySQL max_connections 要 > Node.js 总连接数 × 1.2
- 事务里不要有外部 HTTP 调用,缩短连接持有
- 用 try/finally 或 withConnection 封装,杜绝连接泄漏
- 慢查询要 EXPLAIN,加索引或重写
- 读写分离,从库分担读流量
- 用 pool.execute(自动 prepared 缓存),不要手动 prepare/close
- monitor 连接池活跃数 / 等待队列 / 查询延迟
- 事故前要做压测,知道连接池的极限
- 设置 idleTimeout 避免长时间空闲连接被 MySQL kill
总结
Node.js MySQL 连接池是个"配低了顶不住,配高了 DB 拒绝"的精细活。connectionLimit 不是越大越好,跟 MySQL max_connections 必须协调。事故根因往往不是单点,而是几个问题叠加:慢查询 + 长事务 + 配置不当 = 雪崩。最大的认知改变:连接池不是黑盒,要持续监控连接活跃数 / 等待队列 / 查询时长。大促前一定要压测,把"理论上撑得住"变成"实测撑得住"。这次事故让我们把 DB 监控纳入了 SRE 平台的核心面板。
—— 别看了 · 2026