Bootstrap

【云平台监控】Prometheus AlertManager yaml

部署Alertmanager发送邮件告警

  1. 安装与配置Alertmanager

    cd /opt
    tar xf alertmanager-0.24.0.linux-amd64.tar.gz
    mv alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager
    
  2. 修改配置文件alertmanager.yml

    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.qq.com:465'  # 使用SSL端口
      smtp_from: '[email protected]'
      smtp_auth_username: '[email protected]'
      smtp_auth_password: 'zxnlltckqkrxxxcc'  # QQ邮箱授权码
      smtp_require_tls: true  # 启用TLS
    route:
      group_by: ['alertname']
      group_wait: 20s
      group_interval: 5m
      repeat_interval: 20m
      receiver: 'my-email'
    receivers:
    - name: 'my-email'
      email_configs:
      - to: '[email protected]'
        send_resolved: true
    
  3. 创建Systemd服务并启动

    cat > /usr/lib/systemd/system/alertmanager.service <<EOF
    [Unit]
    Description=Alertmanager
    After=network.target
    
    [Service]
    Type=simple
    ExecStart=/usr/local/alertmanager/alertmanager \
      --config.file=/usr/local/alertmanager/alertmanager.yml \
      --log.level=debug
    Restart=on-failure
    
    [Install]
    WantedBy=multi-user.target
    EOF
    
    systemctl daemon-reload
    systemctl start alertmanager
    systemctl enable alertmanager
    
  4. 验证Alertmanager运行

    netstat -tulnp | grep 9093
    

配置Prometheus告警规则

  1. 创建告警规则文件

    mkdir -p /usr/local/prometheus/alert_rules
    vim /usr/local/prometheus/alert_rules/instance_down.yaml
    
    groups:
    - name: AllInstances
      rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        annotations:
          title: 'Instance down'
          description: 'Instance has been down for more than 1 minute.'
        labels:
          severity: 'critical'
    
  2. 修改Prometheus配置

    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['192.168.80.30:9093']
    rule_files:
      - "/usr/local/prometheus/alert_rules/*.yaml"
    
    systemctl reload prometheus
    

配置钉钉告警

  1. 部署钉钉Webhook插件

    cd /opt
    tar xf prometheus-webhook-dingtalk-2.1.0.linux-amd64.tar.gz
    mv prometheus-webhook-dingtalk-2.1.0.linux-amd64 /usr/local/dingtalk
    cd /usr/local/dingtalk
    
  2. 配置钉钉机器人

    • 在钉钉群添加自定义机器人,获取Webhook URL和加签密钥。
  3. 修改插件配置文件config.yml

    targets:
      webhook1:
        url: https://oapi.dingtalk.com/robot/send?access_token=your_token
        secret: your_secret
    
  4. 启动钉钉服务

    ./prometheus-webhook-dingtalk &
    
  5. 修改Alertmanager配置

    route:
      receiver: 'dingding.webhook1'
    receivers:
    - name: 'dingding.webhook1'
      webhook_configs:
      - url: 'http://192.168.80.30:8060/dingtalk/webhook1/send'
        send_resolved: true
    
    systemctl reload alertmanager
    

测试告警

  1. 触发实例宕机告警

    systemctl stop node_exporter
    
    • 等待1分钟后检查邮件和钉钉群消息。
  2. 验证告警状态

    • 访问Prometheus界面:http://<Prometheus-IP>:9090/alerts
    • 查看Alertmanager界面:http://<Alertmanager-IP>:9093

常见问题排查~

  • 邮件未收到

    • 检查SMTP配置(端口、TLS、授权码)。
    • 查看Alertmanager日志:journalctl -u alertmanager -f
  • 钉钉告警失败

    • 确认Webhook URL和密钥正确。
    • 检查钉钉插件日志:tail -f /usr/local/dingtalk/prometheus-webhook-dingtalk.log
  • 告警规则未加载

    • 确认Prometheus配置文件路径正确。
    • 重启Prometheus:systemctl reload prometheus

;