自动化监控告警系统 – 服务器状态实时掌握

服务器半夜宕机没人知道？网站挂了 1 小时才发现？我经历过太多次这种尴尬。后来搭建了一套完整的监控告警系统，CPU 超标、磁盘满了、服务挂了，1 分钟内微信/邮件/短信全收到通知。今天把完整方案分享出来。

一、监控方案选型

试过几种方案：

Zabbix – 功能强大但太重，配置复杂
Nagios – 老牌稳定，界面古老
Prometheus+Grafana – 现代流行，学习曲线陡
Shell 脚本 + cron – 轻量灵活，适合小团队

最终选择：Prometheus 监控 + Grafana 展示 + 自定义告警，既有专业级功能，又能灵活定制通知渠道。

二、快速部署 Prometheus

步骤 1：创建监控用户

useradd -m -s /bin/bash prometheus
mkdir -p /opt/prometheus
chown prometheus:prometheus /opt/prometheus

步骤 2：下载并安装

cd /tmp
wget https://github.com/prometheus/prometheus/releases/latest/download/prometheus-*.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
mv prometheus-* /opt/prometheus/server
chown -R prometheus:prometheus /opt/prometheus

步骤 3：配置 systemd 服务

cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/server/prometheus \
  --config.file /opt/prometheus/server/prometheus.yml \
  --storage.tsdb.path /opt/prometheus/data \
  --web.console.templates=/opt/prometheus/server/consoles \
  --web.console.libraries=/opt/prometheus/server/console_libraries

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable prometheus
systemctl start prometheus

三、安装 Node Exporter（采集服务器指标）

步骤 1：下载 Node Exporter

wget https://github.com/prometheus/node_exporter/releases/latest/download/node_exporter-*.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
mv node_exporter-* /opt/prometheus/node_exporter
chown -R prometheus:prometheus /opt/prometheus/node_exporter

步骤 2：创建服务文件

cat > /etc/systemd/system/node_exporter.service << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/node_exporter/node_exporter

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node_exporter
systemctl start node_exporter

步骤 3：验证运行

curl http://localhost:9100/metrics
# 应看到大量监控指标输出

四、配置 Prometheus 抓取任务

编辑 prometheus.yml：

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  # 添加多台服务器
  - job_name: 'web-servers'
    static_configs:
      - targets: ['192.168.1.10:9100', '192.168.1.11:9100']
        labels:
          env: 'production'

重启 Prometheus 使配置生效：

systemctl restart prometheus

五、安装 Grafana（可视化展示）

步骤 1：添加 GPG 密钥

apt install software-properties-common
add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | apt-key add -

步骤 2：安装 Grafana

apt update
apt install grafana -y
systemctl enable grafana-server
systemctl start grafana-server

步骤 3：访问 Grafana

浏览器打开：http://你的服务器 IP:3000

默认账号密码：admin / admin

步骤 4：添加 Prometheus 数据源

进入 Configuration → Data Sources
Add data source → Prometheus
URL: http://localhost:9090
Save & Test

步骤 5：导入仪表盘模板

进入 Dashboards → Import
输入模板 ID：1860（Node Exporter 完整监控）
选择 Prometheus 数据源
Import

六、配置告警规则

创建告警规则文件：

cat > /opt/prometheus/server/rules/alerts.yml << 'EOF'
groups:
  - name: server_alerts
    interval: 30s
    rules:
      # CPU 使用率超过 80%
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "{{ $labels.instance }} 的 CPU 使用率超过 80%（当前值：{{ $value }}%）"

      # 内存使用率超过 85%
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "{{ $labels.instance }} 的内存使用率超过 85%"

      # 磁盘使用率超过 90%
      - alert: HighDiskUsage
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间不足"
          description: "{{ $labels.instance }} 根分区使用率超过 90%"

      # 服务器宕机
      - alert: InstanceDown
        expr: up{job="node"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "服务器离线"
          description: "{{ $labels.instance }} 已离线超过 1 分钟"
EOF

在 prometheus.yml 中启用规则：

rule_files:
  - "rules/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

七、安装 Alertmanager（发送告警通知）

步骤 1：下载 Alertmanager

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-*.linux-amd64.tar.gz
tar xvfz alertmanager-*.tar.gz
mv alertmanager-* /opt/prometheus/alertmanager
chown -R prometheus:prometheus /opt/prometheus/alertmanager

步骤 2：配置通知渠道

cat > /opt/prometheus/alertmanager/alertmanager.yml << 'EOF'
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'wechat-alert'

receivers:
  # 微信企业版通知
  - name: 'wechat-alert'
    webhook_configs:
      - url: 'http://localhost:8080/wechat'  # 自定义 webhook 服务
        send_resolved: true

  # 邮件通知
  - name: 'email-alert'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.qq.com:465'
        auth_username: '[email protected]'
        auth_password: 'your_smtp_password'
        send_resolved: true
EOF

八、实战：微信告警推送

用 Python 写个简单的 webhook 服务，对接企业微信：

#!/usr/bin/env python3
from flask import Flask, request, jsonify
import requests
import json

app = Flask(__name__)

# 企业微信 Webhook URL
WECHAT_WEBHOOK = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY"

@app.route('/wechat', methods=['POST'])
def wechat_alert():
    data = request.json
    alerts = data.get('alerts', [])
    
    for alert in alerts:
        status = "🔴 告警" if alert['status'] == 'firing' else "🟢 恢复"
        message = f"{status}\n\n"
        message += f"*告警名称：* {alert['labels']['alertname']}\n"
        message += f"*服务器：* {alert['labels']['instance']}\n"
        message += f"*详情：* {alert['annotations']['description']}\n"
        
        payload = {
            "msgtype": "markdown",
            "markdown": {"content": message}
        }
        
        requests.post(WECHAT_WEBHOOK, json=payload)
    
    return jsonify({"status": "ok"})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

运行服务：

pip install flask requests
python3 wechat_webhook.py &

��、我踩过的坑

坑 1：告警风暴

刚开始没配置 repeat_interval，告警持续发送，手机被打爆。解决：设置合理的重复间隔（1 小时）。

坑 2：误报太多

CPU 瞬间飙升也告警，实际不影响使用。解决：增加for 条件，持续 5 分钟超标才告警。

坑 3：时区问题

Grafana 显示时间不对。解决：在 Grafana 配置文件设置时区为 Asia/Shanghai。

十、高级玩法：多服务器统一监控

如果有多个服务器，需要在每台安装 Node Exporter，然后在 Prometheus 配置文件中添加所有目标：

scrape_configs:
  - job_name: 'all-servers'
    static_configs:
      - targets:
          - '192.168.1.10:9100'  # Web 服务器
          - '192.168.1.11:9100'  # 数据库服务器
          - '192.168.1.12:9100'  # 缓存服务器
        labels:
          env: 'production'
          team: 'ops'

总结

搭建这套监控系统用了 2 小时，但带来的价值是巨大的：

✅ 服务器状态一目了然
✅ 故障 1 分钟内感知
✅ 告警多渠道推送
✅ 历史数据可追溯
✅ 性能瓶颈可视化

对于运维来说，监控是你的眼睛和耳朵，这套方案值得投入。

来源：https://mjj.728.hk/