服务器半夜宕机没人知道?网站挂了 1 小时才发现?我经历过太多次这种尴尬。后来搭建了一套完整的监控告警系统,CPU 超标、磁盘满了、服务挂了,1 分钟内微信/邮件/短信全收到通知。今天把完整方案分享出来。
一、监控方案选型
试过几种方案:
- Zabbix – 功能强大但太重,配置复杂
- Nagios – 老牌稳定,界面古老
- Prometheus+Grafana – 现代流行,学习曲线陡
- Shell 脚本 + cron – 轻量灵活,适合小团队
最终选择:Prometheus 监控 + Grafana 展示 + 自定义告警,既有专业级功能,又能灵活定制通知渠道。
二、快速部署 Prometheus
步骤 1:创建监控用户
useradd -m -s /bin/bash prometheus mkdir -p /opt/prometheus chown prometheus:prometheus /opt/prometheus
步骤 2:下载并安装
cd /tmp wget https://github.com/prometheus/prometheus/releases/latest/download/prometheus-*.linux-amd64.tar.gz tar xvfz prometheus-*.tar.gz mv prometheus-* /opt/prometheus/server chown -R prometheus:prometheus /opt/prometheus
步骤 3:配置 systemd 服务
cat > /etc/systemd/system/prometheus.service << 'EOF' [Unit] Description=Prometheus Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/opt/prometheus/server/prometheus \ --config.file /opt/prometheus/server/prometheus.yml \ --storage.tsdb.path /opt/prometheus/data \ --web.console.templates=/opt/prometheus/server/consoles \ --web.console.libraries=/opt/prometheus/server/console_libraries [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable prometheus systemctl start prometheus
三、安装 Node Exporter(采集服务器指标)
步骤 1:下载 Node Exporter
wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*.linux-amd64.tar.gz tar xvfz node_exporter-*.tar.gz mv node_exporter-* /opt/prometheus/node_exporter chown -R prometheus:prometheus /opt/prometheus/node_exporter
步骤 2:创建服务文件
cat > /etc/systemd/system/node_exporter.service << 'EOF' [Unit] Description=Node Exporter Wants=network-online.target After=network-online.target [Service] User=prometheus Group=prometheus Type=simple ExecStart=/opt/prometheus/node_exporter/node_exporter [Install] WantedBy=multi-user.target EOF systemctl daemon-reload systemctl enable node_exporter systemctl start node_exporter
步骤 3:验证运行
curl http://localhost:9100/metrics # 应看到大量监控指标输出
四、配置 Prometheus 抓取任务
编辑 prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# 添加多台服务器
- job_name: 'web-servers'
static_configs:
- targets: ['192.168.1.10:9100', '192.168.1.11:9100']
labels:
env: 'production'
重启 Prometheus 使配置生效:
systemctl restart prometheus
五、安装 Grafana(可视化展示)
步骤 1:添加 GPG 密钥
apt install software-properties-common add-apt-repository "deb https://packages.grafana.com/oss/deb stable main" wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -
步骤 2:安装 Grafana
apt update apt install grafana -y systemctl enable grafana-server systemctl start grafana-server
步骤 3:访问 Grafana
浏览器打开:http://你的服务器 IP:3000
默认账号密码:admin / admin
步骤 4:添加 Prometheus 数据源
- 进入 Configuration → Data Sources
- Add data source → Prometheus
- URL: http://localhost:9090
- Save & Test
步骤 5:导入仪表盘模板
- 进入 Dashboards → Import
- 输入模板 ID:1860(Node Exporter 完整监控)
- 选择 Prometheus 数据源
- Import
六、配置告警规则
创建告警规则文件:
cat > /opt/prometheus/server/rules/alerts.yml << 'EOF'
groups:
- name: server_alerts
interval: 30s
rules:
# CPU 使用率超过 80%
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "{{ $labels.instance }} 的 CPU 使用率超过 80%(当前值:{{ $value }}%)"
# 内存使用率超过 85%
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "{{ $labels.instance }} 的内存使用率超过 85%"
# 磁盘使用率超过 90%
- alert: HighDiskUsage
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "磁盘空间不足"
description: "{{ $labels.instance }} 根分区使用率超过 90%"
# 服务器宕机
- alert: InstanceDown
expr: up{job="node"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务器离线"
description: "{{ $labels.instance }} 已离线超过 1 分钟"
EOF
在 prometheus.yml 中启用规则:
rule_files:
- "rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
七、安装 Alertmanager(发送告警通知)
步骤 1:下载 Alertmanager
cd /tmp wget https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-*.linux-amd64.tar.gz tar xvfz alertmanager-*.tar.gz mv alertmanager-* /opt/prometheus/alertmanager chown -R prometheus:prometheus /opt/prometheus/alertmanager
步骤 2:配置通知渠道
cat > /opt/prometheus/alertmanager/alertmanager.yml << 'EOF'
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'wechat-alert'
receivers:
# 微信企业版通知
- name: 'wechat-alert'
webhook_configs:
- url: 'http://localhost:8080/wechat' # 自定义 webhook 服务
send_resolved: true
# 邮件通知
- name: 'email-alert'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.qq.com:465'
auth_username: '[email protected]'
auth_password: 'your_smtp_password'
send_resolved: true
EOF
八、实战:微信告警推送
用 Python 写个简单的 webhook 服务,对接企业微信:
#!/usr/bin/env python3
from flask import Flask, request, jsonify
import requests
import json
app = Flask(__name__)
# 企业微信 Webhook URL
WECHAT_WEBHOOK = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"
@app.route('/wechat', methods=['POST'])
def wechat_alert():
data = request.json
alerts = data.get('alerts', [])
for alert in alerts:
status = "🔴 告警" if alert['status'] == 'firing' else "🟢 恢复"
message = f"{status}\n\n"
message += f"*告警名称:* {alert['labels']['alertname']}\n"
message += f"*服务器:* {alert['labels']['instance']}\n"
message += f"*详情:* {alert['annotations']['description']}\n"
payload = {
"msgtype": "markdown",
"markdown": {"content": message}
}
requests.post(WECHAT_WEBHOOK, json=payload)
return jsonify({"status": "ok"})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
运行服务:
pip install flask requests python3 wechat_webhook.py &
���、我踩过的坑
坑 1:告警风暴
刚开始没配置 repeat_interval,告警持续发送,手机被打爆。解决:设置合理的重复间隔(1 小时)。
坑 2:误报太多
CPU 瞬间飙升也告警,实际不影响使用。解决:增加for 条件,持续 5 分钟超标才告警。
坑 3:时区问题
Grafana 显示时间不对。解决:在 Grafana 配置文件设置时区为 Asia/Shanghai。
十、高级玩法:多服务器统一监控
如果有多个服务器,需要在每台安装 Node Exporter,然后在 Prometheus 配置文件中添加所有目标:
scrape_configs:
- job_name: 'all-servers'
static_configs:
- targets:
- '192.168.1.10:9100' # Web 服务器
- '192.168.1.11:9100' # 数据库服务器
- '192.168.1.12:9100' # 缓存服务器
labels:
env: 'production'
team: 'ops'
总结
搭建这套监控系统用了 2 小时,但带来的价值是巨大的:
- ✅ 服务器状态一目了然
- ✅ 故障 1 分钟内感知
- ✅ 告警多渠道推送
- ✅ 历史数据可追溯
- ✅ 性能瓶颈可视化
对于运维来说,监控是你的眼睛和耳朵,这套方案值得投入。
来源:https://mjj.728.hk/