我是靠谱客的博主 缥缈发卡,最近开发中收集的这篇文章主要介绍极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

一、前言与目录

监控是当前云原生时代下可观测性中关键性的一环,较之前相比,云原生时代已经发生了诸多变化,诸如微服务,容器化等技术层出不穷,且云原生时代的演进速度,更新速度极快,相对应监控所产生的数据量大大增加,对实时性的要求也大大增加。为应对变化,Prometheus应运而生,其所可实现的功能,与云原生极好的契合度,集成第三方开源组件的便利性,无疑使其成为无疑是最为耀眼的明星之一。

本文着重在于介绍如何利用Prometheus搭建监控系统,涵盖探针,指标设定,可视化,告警设定,容器监控等。这是一篇入门级教程,暂不涵盖gateway,K8S集群等的相关内容。关于Prometheus的基本知识与概念,自行google之,本文重点描述实战过程。

目录:

  1. 部署Prometheus Server
  2. 部署监控探针
  3. 部署Grafana
  4. 部署AlertManager
  5. 部署PrometheusAlert
  6. 配置告警规则

二、部署Prometheus Server

本节主要介绍以docker的方式部署Prometheus Server,并预留映射相关配置项

2.1 配置环境

  1. 创建文件夹并授予权限
sudo mkdir -pv /data/docker/prometheus/{data,alert_rules,job}
sudo chown -R myusername:myusername /data/docker/prometheus/

其中,

  • data文件夹用于存放prometheus产生的数据
  • alert_rules文件夹用于存放prometheus alert告警规则配置文件
  • job用于存放监控对象配置json文件
  • myusername可替换为实际的用户名
  1. 执行本条命令以避免出现 permission denied 错误
sudo chown 65534:65534 -R /data/docker/prometheus/data
  1. 拷贝配置文件到指定的目录,注意下,需要关注该文件中涉及“$ip”的部分,后续配置,诸如添加AlertManager后,记得返回修改修改此处。
# my global config
global:
scrape_interval:
15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
# - targets: ["$ip:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- /etc/prometheus/alert_rules/*.rules
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
file_sd_configs:
- files:
- /etc/prometheus/job/prometheus.json
refresh_interval: 1m
# 重载配置文件
# Node 主机组
- job_name: 'host'
#basic_auth:
#
username: prometheus
#
password: prometheus
file_sd_configs:
- files:
- /etc/prometheus/job/host.json
refresh_interval: 1m
# cadvisor 容器组
- job_name: 'cadvisor'
file_sd_configs:
- files:
- /etc/prometheus/job/cadvisor.json
refresh_interval: 1m
# mysql exporter 组
- job_name: 'mysqld-exporter'
file_sd_configs:
- files:
- /etc/prometheus/job/mysqld-exporter.json
refresh_interval: 1m
# blackbox ping 组
- job_name: 'blackbox_ping'
scrape_interval: 5s
scrape_timeout: 2s
metrics_path: /probe
params:
module: [ping]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/ping/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
# blackbox http get 2xx 组
- job_name: 'blackbox_http_2xx'
scrape_interval: 5s
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/http_2xx/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
- job_name: "blackbox_tcp"
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/tcp/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
- job_name: 'blackbox_ssh_banner'
metrics_path: /probe
params:
module: [ssh_banner]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/ssh_banner/*.json
refresh_interval: 1m
relabel_configs:
# Ensure port is 22, pass as URL parameter
- source_labels: [__address__]
regex: (.*?)(:.*)?
replacement: ${1}:22
target_label: __param_target
# Make instance label the target
- source_labels: [__param_target]
target_label: instance
# Actually talk to the blackbox exporter though
- target_label: __address__
replacement: $ip:9115
- job_name: "blackbox_dns"
metrics_path: /probe
params:
module: [dns_udp]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/dns/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115

2.2 启动服务端

docker run -itd

-p 9090:9090 
-v /data/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro 
-v /data/docker/prometheus/alert_rules:/etc/prometheus/alert_rules 
-v /data/docker/prometheus/job:/etc/prometheus/job 
-v /data/docker/prometheus/data:/data/prometheus/ 
-v /etc/timezone:/etc/timezone:ro 
-v /etc/localtime:/etc/localtime:ro 
--name prometheus 
--restart=always 
prom/prometheus:v2.28.1 
--config.file=/etc/prometheus/prometheus.yml

--storage.tsdb.path=/data/prometheus/ 
--storage.tsdb.retention.time=30d 
--web.read-timeout=5m 
--web.max-connections=10 
--query.max-concurrency=20 
--query.timeout=2m 
--web.enable-lifecycle

启动成功后,通过浏览器访问 http://$ip:9090 可看到界面。

如果系统打开了防火墙,你可能需要给以下几个端口开白名单,以centos7为例,

sudo firewall-cmd --zone=public --add-port=9090/tcp --permanent
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --zone=public --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

2.3 部署 Prometheus Server 参考文档

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

三、部署监控探针

Prometheus与Zabbix不同,Prometheus主要采用主动拉取的模式,通过Exporter提供的接口读取监控数据。Exporter负责采集数据,可以把Exporter理解为探针,并通过http的方式提供接口供Server调用读取数据,读者可自行google本文未描述的各个exporter提供的返回结果内字段的含义。

3.1 部署node_exporter

node_exporter用于监控主机的CPU,内存,磁盘,I/O等的信息。侧重点在于主机系统本身的数据采集。

  1. 下载 node exporter 并解压

登录需要被监控的主机,可从 此处 下载 node exporter

或者运行 curl -O https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

下载完成后,运行以下命令解压二进制包

tar xvfz node_exporter-1.2.0.linux-amd64.tar.gz
sudo mkdir -p /data/node_exporter/
sudo mv node_exporter-1.2.0.linux-amd64/* /data/node_exporter/
  1. 创建prometheus用户
sudo groupadd prometheus
sudo useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
sudo chown prometheus.prometheus -R /data/node_exporter/
  1. 创建Systemd服务

添加并编辑文件

sudo nano /etc/systemd/system/node_exporter.service

写入以下内容

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
  1. 使用systemctl 启动 node exporter
    启动并查看服务是否正常
sudo systemctl start node_exporter
sudo systemctl status node_exporter

应该返回类似以下的文本

● node_exporter.service - node_exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
Main PID: 11050 (node_exporter)
CGroup: /system.slice/node_exporter.service
└─11050 /usr/local/prometheus/node_exporter/node_exporter

设置开机启动: sudo systemctl enable node_exporter

  1. 开启防火墙白名单
    执行curl localhost:9100,如可以看到返回的网页,说明 node exporter 已经启动成功了。
    在同网段内其他机器执行curl http://$ip:9100/,应同样可以看到返回的页面。

如果看不到返回的页面,可以检查下是否为防火墙端口未开

sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --reload
  1. 配置 Prometheus
    登录 Prometheus 服务端,编辑以下文件
    nano /data/docker/prometheus/job/host.json, 内容参考如下,ip地址自行更改实际的ip地址
[
{
"targets": [ "192.168.1.100:9100"],
"labels": {
"subject": "node_exporter",
"hostname": "server1"
}
},
{
"targets": [ "192.168.1.101:9100"],
"labels": {
"subject": "node_exporter",
"hostname": "server2"
}
}
]
  1. 部署 node_exporter 可参考文档
    https://github.com/prometheus/node_exporter
    https://prometheus.io/docs/guides/node-exporter/
    https://www.jianshu.com/p/7bec152d1a1f

3.2 部署mysqld-exporter

mysqld-exporter 用于监控MySQL数据库的性能等数据。

  1. 登录mysql数据库所在主机,并通过docker方式启动
docker run -d 
-p 9104:9104 
--link mysql

--name mysqld-exporter 
--restart on-failure:5 
-e DATA_SOURCE_NAME="root:pwdpwdpwdpwdpwd@(mysql:3306)/" 
prom/mysqld-exporter:v0.13.0

启动后,访问http://127.0.0.1:9104/metrics,可看到监控信息,同时从Prometheus服务端访问也应要可访问的到。

  1. 部署 mysqld-exporter 可参考文档
    https://github.com/prometheus/mysqld_exporter
    https://registry.hub.docker.com/r/prom/mysqld-exporter/

3.3 部署cadvisor

cadvisor用于监控容器的状态。

  1. 登录docker所在主机并通过运行以下脚本启动cadvisor
docker run 
--volume=/:/rootfs:ro 
--volume=/var/run:/var/run:ro 
--volume=/sys:/sys:ro 
--volume=/var/lib/docker/:/var/lib/docker:ro 
--volume=/dev/disk/:/dev/disk:ro 
--publish=9101:8080 
--detach=true 
--name=cadvisor 
--restart on-failure:5 
--privileged 
--device=/dev/kmsg 
gcr.io/cadvisor/cadvisor:v0.38.6

你可能会找到两种 cadvisor,一种是 gcr.io/cadvisor/cadvisor, 另一种是 google/cadvisor, 建议使用 gcr.io/cadvisor/cadvisor

  1. 配置 Prometheus 服务端
    登录Prometheus 服务端所在主机,编辑 nano /data/docker/prometheus/job/cadvisor.json 文件, 内容参考如下:
[
{
"targets": [ "192.168.1.100:9101"],
"labels": {
"subject": "cadvisor",
"hostname": "server1"
}
},
{
"targets": [ "192.168.1.101:9101"],
"labels": {
"subject": "cadvisor",
"hostname": "server2"
}
}
]
  1. 如果docker所在主机存在防火墙,记得添加防火墙白名单
sudo firewall-cmd --zone=public --add-port=9101/tcp --permanent
sudo firewall-cmd --reload
  1. 部署 cadvisor 可参考文档
    https://github.com/google/cadvisor

3.4 部署blackbox_exporter

blackbox_exporter 是以黑盒方式进行监控的工具

  1. 创建配置文件
    登录 Prometheus 服务端主机,执行以下命令
sudo mkdir -p /data/docker/blackbox/conf
sudo chown -R myusername:myusername /data/docker/blackbox

并添加编辑该文件

nano /data/docker/blackbox/conf/blackbox.yml

yml文件范本如下:

modules:
ping:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4" # defaults to "ip4"
ip_protocol_fallback: false
# no fallback to "ip6"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
preferred_ip_protocol: "ip4"
http_post_2xx_json:
prober: http
timeout: 30s
http:
preferred_ip_protocol: "ip4"
method: POST
headers:
Content-Type: application/json
body: '{"key1":""vlaue1,"params":{"param2":"vlaue2"}}'
http_basic_auth:
prober: http
timeout: 60s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
tcp_connect:
prober: tcp
timeout: 5s
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: SSH-2.0-blackbox-ssh-check
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
dns_udp:
prober: dns
timeout: 10s
dns:
transport_protocol: udp
preferred_ip_protocol: ip4
query_name: "www.example.cn"
query_type: "A"
  1. 配置 Prometheus
    继续在 Prometheus 服务端主机,执行以下命令
sudo mkdir -p /data/docker/prometheus/job/blackbox/
sudo mkdir -pv /data/docker/prometheus/job/blackbox/{dns,http_2xx,ping,ssh_banner,tcp}
sudo chown -R myusername:myusername /data/docker/prometheus/job/blackbox/

以下依次在/data/docker/prometheus/job/blackbox/下的对应的文件夹中,创建json文件,并参考样本写入配置

在dns文件夹下,创建 dns.json,样本如下

[
{
"targets": [ "192.168.1.1"],
"labels": {
"subject": "blackbox_dns",
"app": "my_dns"
}
}
]

在http_2xx文件夹下,创建 search-site.json,样本如下

[
{
"targets": [ "https://www.google.cn/?HealthCheck"],
"labels": {
"app": "google",
"subject": "blackbox_http_2xx",
"hostname": "server-01"
}
},
{
"targets": [ "https://cn.bing.com/?HealthCheck"],
"labels": {
"app": "bing",
"subject": "blackbox_http_2xx",
"hostname": "server-02"
}
}
]

在ping文件夹下,创建 search-site.json,样本如下

[
{
"targets": [ "www.google.cn"],
"labels": {
"app": "google",
"subject": "blackbox_ping",
"hostname": "server-01"
}
},
{
"targets": [ "cn.bing.com"],
"labels": {
"app": "bing",
"subject": "blackbox_ping",
"hostname": "server-02"
}
}
]

在ssh_banner文件夹下,创建 ssh-banner.json,样本如下

[
{
"targets": [ "192.168.1.100:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-01"
}
},
{
"targets": [ "192.168.1.101:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-02"
}
}
]

在tcp文件夹下,创建 tcp.json,样本如下

[
{
"targets": [ "$ip:3306"],
"labels": {
"app": "mysql.example.cn",
"subject": "blackbox_tcp",
"hostname": "mysql"
}
}
]
  1. 运行blackbox_exporter

在 Prometheus服务端所在的主机,运行以下命令,使用容器启动blackbox_exporter

docker run -d 
--restart on-failure:5 
-p 9115:9115 
-v /data/docker/blackbox/conf/blackbox.yml:/config/blackbox.yml:ro 
--name blackbox_exporter 
prom/blackbox-exporter:v0.19.0 
--config.file=/config/blackbox.yml

启动成功后,访问http://$ip:9090/targets,可看到至今为止,我们配置的所有探针所反馈回来的数据,其中,State应为UP状态。

  1. 部署 blackbox_exporter 可参考文档
    https://github.com/prometheus/blackbox_exporter
    https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/install_blackbox_exporter

四、部署Grafana

接下来,部署可视化工具Grafana,Grafana可快速集成Prometheus,并通过设定甚至是使用现成的模板,快速将采集结果转变为图形化的页面。

4.1 启动

运行以下命令,做启动前的准备工作

sudo mkdir -p /data/docker/grafana
sudo chown 472:472 /data/docker/grafana -R

通过docker运行grafana

docker run -d 
-p 3000:3000 
--name=grafana 
-v /data/docker/grafana:/var/lib/grafana 
-v /etc/localtime:/etc/localtime:ro 
--restart=always 
--name grafana 
grafana/grafana:8.0.6

启动成功后,可通过http://$ip:3000访问页面,默认账号密码: admin / admin 。

4.2 配置

  1. 配置数据源
    点击“Configuration -> Data sources”,进入 http://$ip:3000/datasources,增加Prometheus数据源,并做好配置。

  2. 配置Dashboards

点击“Dashboards -> Manage -> import”,进入http://$ip:3000/dashboard/import,导入 Grafana Dashboards 模板,在Import via grafana.com处,填入你想要导入的模板id,常用的模板id如下:

  • node exporter ID: 8919
  • Cadvisor ID: 14282
  • mysqld-exporter ID: 7362

你也可以在https://grafana.com/grafana/dashboards,自行搜索 Dashboards 模板。也可以自行创建dashboard面板。

4.3 部署 Grafana 可参考文档

https://grafana.com/docs/grafana/latest/installation/docker/

五、部署AlertManager

截至到现在,我们已经部署好Prometheus Server,Exporter,Grafana可视化组件,我们还需要配置告警组件,当故障出现时,监控系统可通过多种方式告知接收人,以便接收人及时知晓并处理。但Prometheus本身并不自带告警工具,Prometheus可以通过预配置的规则,将信息发送到AlertManager,由AlertManager统一处理告警信息,并通过邮箱,短信,微信,钉钉等方式告知告警接收人。和Grafana一样,AlertManager同样不仅仅支持Prometheus,也支持集成处理其他程序的信息。

5.1 准备工作

运行以下命令

sudo mkdir -pv /data/docker/alertmanager
sudo chown -R myusername:myusername /data/docker/alertmanager/
cd /data/docker/alertmanager

/data/docker/alertmanager文件夹中,创建alertmanager.yml 和 email.tmpl 文件,

alertmanager.yml的样例如下,注意要设置smtp相关配置项与webhook的ddurl:

global:
resolve_timeout: 5m
# 邮件SMTP配置
smtp_smarthost: 'smtp.gmail.com:465'
smtp_from: 'example@gmail.com'
smtp_auth_username: 'example@gmail.com'
smtp_auth_password: 'xxxxx'
smtp_require_tls: false
# 自定义通知模板
templates:
- '/etc/alertmanager/email.tmpl'
# route用来设置报警的分发策略
route:
# 采用哪个标签来作为分组依据
group_by: ['alertname']
# 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
group_wait: 10s
# 两组告警的间隔时间
group_interval: 10s
# 重复告警的间隔时间,减少相同邮件的发送频率
repeat_interval: 1h
# 设置默认接收人
receiver: 'myreceiver'
routes:
# 可以指定哪些组接手哪些消息
- receiver: 'myreceiver'
continue: true
group_wait: 10s
receivers:
- name: 'myreceiver'
#send_resolved: true
email_configs:
# - to: 'example@gmail.com, example2@gmail.com'
- to: 'example@gmail.com'
html: '{{ template "email.to.html" . }}'
headers: { Subject: "Prometheus [Warning] 报警邮件" }
# 钉钉配置
webhook_configs:
- url: 'http://$ip:18080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx'

email.tmpl的样例如下,注意样例中有一个"2006-01-02 15:04:05",这个时间不能改,否则报警显示时间可能会不正确:

{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }}
<br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}
{{- end }}

这两个配置文件,可参考 https://prometheus.io/docs/alerting/latest/configuration/ 进行修改。

5.1 启动AlertManager

运行以下命令

docker run -d -p 9093:9093 
-v /data/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro 
-v /data/docker/alertmanager/email.tmpl/:/etc/alertmanager/email.tmpl:ro 
--name alertmanager 
--restart=always 
prom/alertmanager:v0.22.2

5.3 访问

启动成功后,可通过http://$ip:9093访问alertmanager组件

六、部署PrometheusAlert

上节已经提到,Prometheus告警需由两部分组成,上节我们已经部署好AlertManager用于信息处理与通知,本节我们需要定义好Prometheus的配置规则,如此Prometheus便可以产生告警信息并发送到AlertManager。

6.1 准备工作

运行以下命令

sudo mkdir -p /data/docker/prometheus-alert/conf
sudo chown -R fenixadar:fenixadar /data/docker/prometheus-alert/
nano /data/docker/prometheus-alert/conf/app.conf

从 https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/conf/app-example.conf 下载文件并移动到 /data/docker/prometheus-alert/conf/app.conf

6.2 启动

运行以下命令开启prometheus-alert

docker run -d --publish=18080:8080 
-v /data/docker/prometheus-alert/conf/:/app/conf:ro 
-v /data/docker/prometheus-alert/db/:/app/db 
-v /data/docker/prometheus-alert/log/:/app/logs 
--name prometheusalert-center 
feiyu563/prometheus-alert:v-4.5.0

开启成功后,通过http://$ip:18080,访问prometheus-alert界面。用户密码已在 app.conf 中设置。

如果系统开启了防火墙,记得开放白名单

sudo firewall-cmd --zone=public --add-port=18080/tcp --permanent
sudo firewall-cmd --reload

6.3 配置

  1. 配置告警模板

点击AlertTemplate,进入http://$ip:18080/template,此处有各类可对接的第三方系统的模板。
以钉钉的告警模板为例,将模版内容改为如下,主要是修正时间显示慢8小时的问题,以及增加一些信息

{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
## [Prometheus恢复信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别:{{$v.labels.level}}
###### 开始时间:{{GetCSTtime $v.startsAt}}
###### 结束时间:{{GetCSTtime $v.endsAt}}
###### 故障主机名:{{$v.labels.hostname}}
###### 故障主机IP:{{$v.labels.instance}}
###### 故障应用:{{$v.labels.app}}
###### 故障主机对象:{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{else}}
## [Prometheus告警信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别:{{$v.labels.level}}
###### 开始时间:{{GetCSTtime $v.startsAt}}
###### 故障主机名:{{$v.labels.hostname}}
###### 故障主机IP:{{$v.labels.instance}}
###### 故障应用:{{$v.labels.app}}
###### 故障主机对象:{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{end}}
{{ end }}
  1. 设置钉钉机器人
    在钉钉中,新建一个钉钉群,点击“群设置 -> 智能群助手 -> 添加机器人 -> 自定义 -> 安全设置”,把发送信息的服务器IP地址加进去,而后就会有 Webhook 地址。可参考 https://blog.csdn.net/knight_zhou/article/details/105583741

6.4 部署 PrometheusAlert 可参考文档

https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/install.md

七、配置告警规则

我们还需要在Prometheus Server中配置告警规则,告警规则文件引用配置在prometheus.yml文件的rule_files一节中。规则文件格式是yml,依照2.1小节的配置,在/data/docker/prometheus/alert_rules/文件夹创建yml文件,内容如下:

groups:
- name: Node_exporter Down
rules:
- alert: 实例丢失
expr: up{job="node_exporter"} == 0
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.job }}"
address: "{{ $labels.instance }}"
description: "已经有1分钟连接不上实例了."
- alert: CPU使用率过高(> 80)
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}: CPU使用率超过80%,当前使用率{{ $value }}"
- alert: 内存使用率过高(> 80)
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
#告警持续时间,超过这个时间才会发送给alertmanager
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}:内存使用率超过80%. 当前使用率{{ $value }}"
- alert: 内存压力过大 (> 1000)
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 内存压力过大"
description: "{{ $labels.instance }}:内存压力很大. 当前值{{ $value }}"
- alert: 主机网络接口接收了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 2
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})主机入向流量异常"
description: "{{ $labels.instance }}:持续3分钟网口接收太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 主机网络接口发送了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机出向流量异常"
description: "{{ $labels.instance }}:持续3分钟网口发送太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 磁盘每秒读数据(> 50 MB/s)
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取异常"
description: "{{ $labels.instance }}:主机的IO读取有些问题. 当前值每秒{{ $value }}MB"
- alert: 磁盘每秒写数据(> 50 MB/s)
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入异常"
description: "{{ $labels.instance }}:主机的IO写入有些问题. 当前值每秒{{ $value }}MB"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: 磁盘可用空间(<10% left)
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机磁盘告急"
description: "{{ $labels.instance }}: 主机大约还剩10%的磁盘存储. 当前可用剩余{{ $value }}%"
- alert: 磁盘读取延迟大(>100ms)
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取延迟大"
description: "{{ $labels.instance }}: 主机的IO读取延迟有些大 >100ms . 当前值{{ $value }}"
- alert: 磁盘写入延迟大(>100ms)
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入延迟大"
description: "{{ $labels.instance }}: 主机的IO写入延迟有些大 >100ms . 当前值{{ $value }}"
# 1000 context switches is an arbitrary number.
# Alert threshold depends on nature of application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
#- alert: 上下文切换的节点越来越多(>1500/s)
#
expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1500
#
for: 3m
#
labels:
#
level: warning
#
annotations:
#
summary: "(instance {{ $labels.instance }}) 主机上下文节点堆积"
#
description: "{{ $labels.instance }}: 主机上下文节点堆积严重 >1500/s . 当前值{{ $value }}"
- alert: 主机 swap 交换分区使用情况 (> 80%)
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机交换空间警告"
description: "{{ $labels.instance }}: 主机交换内存到达 > 80% . 当前值{{ $value }}"
- alert: 主机 systemctl 管理的服务 down
expr: node_systemd_unit_state{state="failed"} == 1
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 有systemctl服务被DOWN"
description: "{{ $labels.instance }}: 的{{ $value }}服务被systemctl方式DOWN了"
- alert: 物理机温度过高( >75°)
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机物理机温度告警"
description: "{{ $labels.instance }}: 主机物理机温度异常( >75°),当前值{{ $value }}"
- alert: 触发物理节点温度报警
expr: node_hwmon_temp_crit_alarm_celsius == 1
for: 0m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }}) 主机主板温度告警"
description: "{{ $labels.instance }}: 主板的温度过高,当前值{{ $value }}"
- alert: 主机五分钟内接收到错误包
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络接收到错误包"
description: "主机
{{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf "%.0f" $value }} 接收错误"
- alert: 主机五分钟内发送了错误包
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络发送了错误包"
description: "主机
{{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf "%.0f" $value }} 接收错误"
- alert: TCP连接时间过长
expr: probe_duration_seconds{job="blackbox_tcp"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接时间大于5秒"
description: "TCP连接时间大于5秒, 当前值{{ $value }}"
- alert: 主机TCP连接数
expr: node_netstat_Tcp_CurrEstab > 800
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接数过多"
description: "{{ $labels.instance }}: 检测过多TCP连接 > 800, 当前值{{ $value }}"
- alert: 待关闭的TCP连接数 > 4000
expr: node_sockstat_TCP_tw > 4000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 等待关闭的TCP连接数 > 4000"
description: "{{ $labels.instance }}: 检测到过多待关闭的TCP连接数, 当前值{{ $value }}"
- alert: 检测到时钟偏差
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 检测到时钟偏差"
description: "{{ $labels.instance }}: 检测到时钟偏差。时钟不同步, 当前值{{ $value }}"
- alert: 容器停止运行检测
expr: time() - container_last_seen > 300
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 容器可能已经停止运行了"
description: "容器:{{ $labels.name }} - {{ $value }}可能已经停止运行了"
# cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
# If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
- alert: 容器CPU使用情况
expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 300
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器CPU过高"
description: "{{ $labels.instance }}: 容器CPU 使用率 >300% , 当前值{{ $value }}%"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
#- alert: 容器内存使用情况
#
expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 85
#
for: 2m
#
labels:
#
level: warning
#
annotations:
#
summary: "(instance {{ $labels.instance }})容器内存过高"
#
description: "{{ $labels.instance }}: 容器{{ $labels.name }}内存 使用率 >85% , 当前值{{ $value }}%"
- alert: 容器卷使用情况
expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume 使用率 >80% , 当前值{{ $value }}%"
- alert: 容器卷IO使用率
expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume IO过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume IO使用率 >80% , 当前值{{ $value }}%"
- alert: 容器高字节流情况
expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器字节流过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }}字节流过高"
- alert: Blackbox探针状态
expr: probe_success == 0
for: 5m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})黑盒检测发现问题"
description: "任务组{{ $labels.job }}采集到问题"
- alert: Blackbox慢采集
expr: avg_over_time(probe_duration_seconds[1m]) > 15
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})黑盒采集过慢"
description: "Blackbox用了{{ $value }}秒多的时间才完成, {{ $labels }}"
- alert: Blackbox Ping时间过长
expr: probe_duration_seconds{job="blackbox_ping"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})Blackbox Ping时间过长"
description: "Ping 时间大于5秒, {{ $value }},{{ $labels }}"
- alert: Blackbox探测Http失败
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 2m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})Blackbox探测Http失败"
description: "HTTP状态代码不是200-399, {{ $value }},{{ $labels }}"
- alert: SSL 证书 30 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 60m
labels:
level: warning
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在30天后过期,{{ $value }},{{ $labels }}"
- alert: SSL 证书 3 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在3天后过期,{{ $value }},{{ $labels }}"
- alert: SSL 证书已经到期
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书已经过期(instance {{ $labels.instance }})
description: "SSL 证书过期了 ,{{ $value }},{{ $labels }}"
- alert: Blackbox采集HTTP过慢
expr: avg_over_time(probe_http_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速Http (instance {{ $labels.instance }})"
description: "HTTP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: Blackbox采集ICMP过慢
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速icmp (instance {{ $labels.instance }})"
description: "ICMP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: DNS服务器宕机
expr: probe_dns_answer_rrs == 0
for: 1m
labels:
level: Warning
annotations:
summary: "DNS服务器宕机"
description: "DNS服务器已经有1分钟未响应了,可能已宕机."

配置好规则文件后,重启Prometheus Server,可在http://$ip:9090/rules页面查看规则。可以自行搜索下警报状态相关的知识点。

可参考文档:https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

最后

以上就是缥缈发卡为你收集整理的极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则的全部内容,希望文章能够帮你解决极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(50)

评论列表共有 0 条评论

立即
投稿
返回
顶部