极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则

78 阅读 0 评论 52 点赞

我是靠谱客的博主缥缈发卡，最近开发中收集的这篇文章主要介绍极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则，觉得挺不错的，现在分享给大家，希望可以做个参考。

概述

一、前言与目录

监控是当前云原生时代下可观测性中关键性的一环，较之前相比，云原生时代已经发生了诸多变化，诸如微服务，容器化等技术层出不穷，且云原生时代的演进速度，更新速度极快，相对应监控所产生的数据量大大增加，对实时性的要求也大大增加。为应对变化，Prometheus应运而生，其所可实现的功能，与云原生极好的契合度，集成第三方开源组件的便利性，无疑使其成为无疑是最为耀眼的明星之一。

本文着重在于介绍如何利用Prometheus搭建监控系统，涵盖探针，指标设定，可视化，告警设定，容器监控等。这是一篇入门级教程，暂不涵盖gateway，K8S集群等的相关内容。关于Prometheus的基本知识与概念，自行google之，本文重点描述实战过程。

部署Prometheus Server
部署监控探针
部署Grafana
部署AlertManager
部署PrometheusAlert
配置告警规则

二、部署Prometheus Server

本节主要介绍以docker的方式部署Prometheus Server，并预留映射相关配置项

2.1 配置环境

创建文件夹并授予权限

sudo mkdir -pv /data/docker/prometheus/{data,alert_rules,job}
sudo chown -R myusername:myusername /data/docker/prometheus/

其中,

data文件夹用于存放prometheus产生的数据
alert_rules文件夹用于存放prometheus alert告警规则配置文件
job用于存放监控对象配置json文件
myusername可替换为实际的用户名

执行本条命令以避免出现 permission denied 错误

sudo chown 65534:65534 -R /data/docker/prometheus/data

拷贝配置文件到指定的目录，注意下，需要关注该文件中涉及“$ip”的部分，后续配置，诸如添加AlertManager后，记得返回修改修改此处。

# my global config
global:
scrape_interval:
15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
# - targets: ["$ip:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- /etc/prometheus/alert_rules/*.rules
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
file_sd_configs:
- files:
- /etc/prometheus/job/prometheus.json
refresh_interval: 1m
# 重载配置文件
# Node 主机组
- job_name: 'host'
#basic_auth:
#
username: prometheus
#
password: prometheus
file_sd_configs:
- files:
- /etc/prometheus/job/host.json
refresh_interval: 1m
# cadvisor 容器组
- job_name: 'cadvisor'
file_sd_configs:
- files:
- /etc/prometheus/job/cadvisor.json
refresh_interval: 1m
# mysql exporter 组
- job_name: 'mysqld-exporter'
file_sd_configs:
- files:
- /etc/prometheus/job/mysqld-exporter.json
refresh_interval: 1m
# blackbox ping 组
- job_name: 'blackbox_ping'
scrape_interval: 5s
scrape_timeout: 2s
metrics_path: /probe
params:
module: [ping]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/ping/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
# blackbox http get 2xx 组
- job_name: 'blackbox_http_2xx'
scrape_interval: 5s
metrics_path: /probe
params:
module: [http_2xx]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/http_2xx/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
- job_name: "blackbox_tcp"
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/tcp/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115
- job_name: 'blackbox_ssh_banner'
metrics_path: /probe
params:
module: [ssh_banner]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/ssh_banner/*.json
refresh_interval: 1m
relabel_configs:
# Ensure port is 22, pass as URL parameter
- source_labels: [__address__]
regex: (.*?)(:.*)?
replacement: ${1}:22
target_label: __param_target
# Make instance label the target
- source_labels: [__param_target]
target_label: instance
# Actually talk to the blackbox exporter though
- target_label: __address__
replacement: $ip:9115
- job_name: "blackbox_dns"
metrics_path: /probe
params:
module: [dns_udp]
file_sd_configs:
- files:
- /etc/prometheus/job/blackbox/dns/*.json
refresh_interval: 1m
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: $ip:9115

2.2 启动服务端

docker run -itd

-p 9090:9090 
-v /data/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro 
-v /data/docker/prometheus/alert_rules:/etc/prometheus/alert_rules 
-v /data/docker/prometheus/job:/etc/prometheus/job 
-v /data/docker/prometheus/data:/data/prometheus/ 
-v /etc/timezone:/etc/timezone:ro 
-v /etc/localtime:/etc/localtime:ro 
--name prometheus 
--restart=always 
prom/prometheus:v2.28.1 
--config.file=/etc/prometheus/prometheus.yml

--storage.tsdb.path=/data/prometheus/ 
--storage.tsdb.retention.time=30d 
--web.read-timeout=5m 
--web.max-connections=10 
--query.max-concurrency=20 
--query.timeout=2m 
--web.enable-lifecycle

启动成功后，通过浏览器访问 http://$ip:9090 可看到界面。

如果系统打开了防火墙，你可能需要给以下几个端口开白名单，以centos7为例，

sudo firewall-cmd --zone=public --add-port=9090/tcp --permanent
sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --zone=public --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

2.3 部署 Prometheus Server 参考文档

https://prometheus.io/docs/prometheus/latest/configuration/configuration/

三、部署监控探针

Prometheus与Zabbix不同，Prometheus主要采用主动拉取的模式，通过Exporter提供的接口读取监控数据。Exporter负责采集数据，可以把Exporter理解为探针，并通过http的方式提供接口供Server调用读取数据，读者可自行google本文未描述的各个exporter提供的返回结果内字段的含义。

3.1 部署node_exporter

node_exporter用于监控主机的CPU，内存，磁盘，I/O等的信息。侧重点在于主机系统本身的数据采集。

下载 node exporter 并解压

登录需要被监控的主机，可从此处下载 node exporter

或者运行 curl -O https://github.com/prometheus/node_exporter/releases/download/v1.2.0/node_exporter-1.2.0.linux-amd64.tar.gz

下载完成后，运行以下命令解压二进制包

tar xvfz node_exporter-1.2.0.linux-amd64.tar.gz
sudo mkdir -p /data/node_exporter/
sudo mv node_exporter-1.2.0.linux-amd64/* /data/node_exporter/

创建prometheus用户

sudo groupadd prometheus
sudo useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
sudo chown prometheus.prometheus -R /data/node_exporter/

创建Systemd服务

添加并编辑文件

sudo nano /etc/systemd/system/node_exporter.service

写入以下内容

[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/data/node_exporter/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target

使用systemctl 启动 node exporter
启动并查看服务是否正常

sudo systemctl start node_exporter
sudo systemctl status node_exporter

应该返回类似以下的文本

● node_exporter.service - node_exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
Main PID: 11050 (node_exporter)
CGroup: /system.slice/node_exporter.service
└─11050 /usr/local/prometheus/node_exporter/node_exporter

设置开机启动: sudo systemctl enable node_exporter

开启防火墙白名单
执行curl localhost:9100，如可以看到返回的网页，说明 node exporter 已经启动成功了。
在同网段内其他机器执行curl http://$ip:9100/，应同样可以看到返回的页面。

如果看不到返回的页面，可以检查下是否为防火墙端口未开

sudo firewall-cmd --zone=public --add-port=9100/tcp --permanent
sudo firewall-cmd --reload

配置 Prometheus
登录 Prometheus 服务端，编辑以下文件
nano /data/docker/prometheus/job/host.json, 内容参考如下，ip地址自行更改实际的ip地址

[
{
"targets": [ "192.168.1.100:9100"],
"labels": {
"subject": "node_exporter",
"hostname": "server1"
}
},
{
"targets": [ "192.168.1.101:9100"],
"labels": {
"subject": "node_exporter",
"hostname": "server2"
}
}
]

部署 node_exporter 可参考文档
https://github.com/prometheus/node_exporter
https://prometheus.io/docs/guides/node-exporter/
https://www.jianshu.com/p/7bec152d1a1f

3.2 部署mysqld-exporter

mysqld-exporter 用于监控MySQL数据库的性能等数据。

登录mysql数据库所在主机，并通过docker方式启动

docker run -d 
-p 9104:9104 
--link mysql

--name mysqld-exporter 
--restart on-failure:5 
-e DATA_SOURCE_NAME="root:pwdpwdpwdpwdpwd@(mysql:3306)/" 
prom/mysqld-exporter:v0.13.0

启动后，访问http://127.0.0.1:9104/metrics，可看到监控信息，同时从Prometheus服务端访问也应要可访问的到。

部署 mysqld-exporter 可参考文档
https://github.com/prometheus/mysqld_exporter
https://registry.hub.docker.com/r/prom/mysqld-exporter/

3.3 部署cadvisor

cadvisor用于监控容器的状态。

登录docker所在主机并通过运行以下脚本启动cadvisor

docker run 
--volume=/:/rootfs:ro 
--volume=/var/run:/var/run:ro 
--volume=/sys:/sys:ro 
--volume=/var/lib/docker/:/var/lib/docker:ro 
--volume=/dev/disk/:/dev/disk:ro 
--publish=9101:8080 
--detach=true 
--name=cadvisor 
--restart on-failure:5 
--privileged 
--device=/dev/kmsg 
gcr.io/cadvisor/cadvisor:v0.38.6

你可能会找到两种 cadvisor，一种是 gcr.io/cadvisor/cadvisor, 另一种是 google/cadvisor, 建议使用 gcr.io/cadvisor/cadvisor

配置 Prometheus 服务端
登录Prometheus 服务端所在主机，编辑 nano /data/docker/prometheus/job/cadvisor.json 文件, 内容参考如下：

[
{
"targets": [ "192.168.1.100:9101"],
"labels": {
"subject": "cadvisor",
"hostname": "server1"
}
},
{
"targets": [ "192.168.1.101:9101"],
"labels": {
"subject": "cadvisor",
"hostname": "server2"
}
}
]

如果docker所在主机存在防火墙，记得添加防火墙白名单

sudo firewall-cmd --zone=public --add-port=9101/tcp --permanent
sudo firewall-cmd --reload

部署 cadvisor 可参考文档
https://github.com/google/cadvisor

3.4 部署blackbox_exporter

blackbox_exporter 是以黑盒方式进行监控的工具

创建配置文件
登录 Prometheus 服务端主机，执行以下命令

sudo mkdir -p /data/docker/blackbox/conf
sudo chown -R myusername:myusername /data/docker/blackbox

并添加编辑该文件

nano /data/docker/blackbox/conf/blackbox.yml

yml文件范本如下：

modules:
ping:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
http_2xx:
prober: http
timeout: 5s
http:
method: GET
preferred_ip_protocol: "ip4" # defaults to "ip4"
ip_protocol_fallback: false
# no fallback to "ip6"
http_post_2xx:
prober: http
timeout: 5s
http:
method: POST
preferred_ip_protocol: "ip4"
http_post_2xx_json:
prober: http
timeout: 30s
http:
preferred_ip_protocol: "ip4"
method: POST
headers:
Content-Type: application/json
body: '{"key1":""vlaue1,"params":{"param2":"vlaue2"}}'
http_basic_auth:
prober: http
timeout: 60s
http:
method: POST
headers:
Host: "login.example.com"
basic_auth:
username: "username"
password: "mysecret"
tls_connect:
prober: tcp
timeout: 5s
tcp:
tls: true
tcp_connect:
prober: tcp
timeout: 5s
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: SSH-2.0-blackbox-ssh-check
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
dns_udp:
prober: dns
timeout: 10s
dns:
transport_protocol: udp
preferred_ip_protocol: ip4
query_name: "www.example.cn"
query_type: "A"

配置 Prometheus
继续在 Prometheus 服务端主机，执行以下命令

sudo mkdir -p /data/docker/prometheus/job/blackbox/
sudo mkdir -pv /data/docker/prometheus/job/blackbox/{dns,http_2xx,ping,ssh_banner,tcp}
sudo chown -R myusername:myusername /data/docker/prometheus/job/blackbox/

以下依次在/data/docker/prometheus/job/blackbox/下的对应的文件夹中，创建json文件，并参考样本写入配置

在dns文件夹下，创建 dns.json，样本如下

[
{
"targets": [ "192.168.1.1"],
"labels": {
"subject": "blackbox_dns",
"app": "my_dns"
}
}
]

在http_2xx文件夹下，创建 search-site.json，样本如下

[
{
"targets": [ "https://www.google.cn/?HealthCheck"],
"labels": {
"app": "google",
"subject": "blackbox_http_2xx",
"hostname": "server-01"
}
},
{
"targets": [ "https://cn.bing.com/?HealthCheck"],
"labels": {
"app": "bing",
"subject": "blackbox_http_2xx",
"hostname": "server-02"
}
}
]

在ping文件夹下，创建 search-site.json，样本如下

[
{
"targets": [ "www.google.cn"],
"labels": {
"app": "google",
"subject": "blackbox_ping",
"hostname": "server-01"
}
},
{
"targets": [ "cn.bing.com"],
"labels": {
"app": "bing",
"subject": "blackbox_ping",
"hostname": "server-02"
}
}
]

在ssh_banner文件夹下，创建 ssh-banner.json，样本如下

[
{
"targets": [ "192.168.1.100:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-01"
}
},
{
"targets": [ "192.168.1.101:22"],
"labels": {
"subject": "blackbox_ssh_banner",
"hostname": "server-02"
}
}
]

在tcp文件夹下，创建 tcp.json，样本如下

[
{
"targets": [ "$ip:3306"],
"labels": {
"app": "mysql.example.cn",
"subject": "blackbox_tcp",
"hostname": "mysql"
}
}
]

运行blackbox_exporter

在 Prometheus服务端所在的主机，运行以下命令，使用容器启动blackbox_exporter

docker run -d 
--restart on-failure:5 
-p 9115:9115 
-v /data/docker/blackbox/conf/blackbox.yml:/config/blackbox.yml:ro 
--name blackbox_exporter 
prom/blackbox-exporter:v0.19.0 
--config.file=/config/blackbox.yml

启动成功后，访问http://$ip:9090/targets，可看到至今为止，我们配置的所有探针所反馈回来的数据，其中，State应为UP状态。

部署 blackbox_exporter 可参考文档
https://github.com/prometheus/blackbox_exporter
https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/exporter/commonly-eporter-usage/install_blackbox_exporter

四、部署Grafana

接下来，部署可视化工具Grafana，Grafana可快速集成Prometheus，并通过设定甚至是使用现成的模板，快速将采集结果转变为图形化的页面。

4.1 启动

运行以下命令，做启动前的准备工作

sudo mkdir -p /data/docker/grafana
sudo chown 472:472 /data/docker/grafana -R

通过docker运行grafana

docker run -d 
-p 3000:3000 
--name=grafana 
-v /data/docker/grafana:/var/lib/grafana 
-v /etc/localtime:/etc/localtime:ro 
--restart=always 
--name grafana 
grafana/grafana:8.0.6

启动成功后，可通过http://$ip:3000访问页面，默认账号密码: admin / admin 。

4.2 配置

配置数据源
点击“Configuration -> Data sources”，进入 http://$ip:3000/datasources，增加Prometheus数据源，并做好配置。
配置Dashboards

点击“Dashboards -> Manage -> import”，进入http://$ip:3000/dashboard/import，导入 Grafana Dashboards 模板，在Import via grafana.com处，填入你想要导入的模板id，常用的模板id如下：

node exporter ID: 8919
Cadvisor ID: 14282
mysqld-exporter ID: 7362

你也可以在https://grafana.com/grafana/dashboards，自行搜索 Dashboards 模板。也可以自行创建dashboard面板。

4.3 部署 Grafana 可参考文档

https://grafana.com/docs/grafana/latest/installation/docker/

五、部署AlertManager

截至到现在，我们已经部署好Prometheus Server，Exporter，Grafana可视化组件，我们还需要配置告警组件，当故障出现时，监控系统可通过多种方式告知接收人，以便接收人及时知晓并处理。但Prometheus本身并不自带告警工具，Prometheus可以通过预配置的规则，将信息发送到AlertManager，由AlertManager统一处理告警信息，并通过邮箱，短信，微信，钉钉等方式告知告警接收人。和Grafana一样，AlertManager同样不仅仅支持Prometheus，也支持集成处理其他程序的信息。

5.1 准备工作

运行以下命令

sudo mkdir -pv /data/docker/alertmanager
sudo chown -R myusername:myusername /data/docker/alertmanager/
cd /data/docker/alertmanager

在/data/docker/alertmanager文件夹中，创建alertmanager.yml 和 email.tmpl 文件，

alertmanager.yml的样例如下，注意要设置smtp相关配置项与webhook的ddurl：

global:
resolve_timeout: 5m
# 邮件SMTP配置
smtp_smarthost: 'smtp.gmail.com:465'
smtp_from: 'example@gmail.com'
smtp_auth_username: 'example@gmail.com'
smtp_auth_password: 'xxxxx'
smtp_require_tls: false
# 自定义通知模板
templates:
- '/etc/alertmanager/email.tmpl'
# route用来设置报警的分发策略
route:
# 采用哪个标签来作为分组依据
group_by: ['alertname']
# 组告警等待时间。也就是告警产生后等待10s，如果有同组告警一起发出
group_wait: 10s
# 两组告警的间隔时间
group_interval: 10s
# 重复告警的间隔时间，减少相同邮件的发送频率
repeat_interval: 1h
# 设置默认接收人
receiver: 'myreceiver'
routes:
# 可以指定哪些组接手哪些消息
- receiver: 'myreceiver'
continue: true
group_wait: 10s
receivers:
- name: 'myreceiver'
#send_resolved: true
email_configs:
# - to: 'example@gmail.com, example2@gmail.com'
- to: 'example@gmail.com'
html: '{{ template "email.to.html" . }}'
headers: { Subject: "Prometheus [Warning] 报警邮件" }
# 钉钉配置
webhook_configs:
- url: 'http://$ip:18080/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=xxxxxx'

email.tmpl的样例如下，注意样例中有一个"2006-01-02 15:04:05"，这个时间不能改，否则报警显示时间可能会不正确：

{{ define "email.to.html" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }}
<br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{ range .Alerts }}
=========start==========<br>
告警程序: prometheus_alert <br>
告警级别: {{ .Labels.severity }} <br>
告警类型: {{ .Labels.alertname }} <br>
告警应用: {{ .Labels.app }} <br>
告警主机: {{ .Labels.instance }} <br>
告警主题: {{ .Annotations.summary }} <br>
告警详情: {{ .Annotations.description }} <br>
触发时间: {{ (.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
恢复时间: {{ (.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }} <br>
=========end==========<br>
{{ end }}{{ end -}}
{{- end }}

这两个配置文件，可参考 https://prometheus.io/docs/alerting/latest/configuration/ 进行修改。

5.1 启动AlertManager

运行以下命令

docker run -d -p 9093:9093 
-v /data/docker/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro 
-v /data/docker/alertmanager/email.tmpl/:/etc/alertmanager/email.tmpl:ro 
--name alertmanager 
--restart=always 
prom/alertmanager:v0.22.2

5.3 访问

启动成功后，可通过http://$ip:9093访问alertmanager组件

六、部署PrometheusAlert

上节已经提到，Prometheus告警需由两部分组成，上节我们已经部署好AlertManager用于信息处理与通知，本节我们需要定义好Prometheus的配置规则，如此Prometheus便可以产生告警信息并发送到AlertManager。

6.1 准备工作

运行以下命令

sudo mkdir -p /data/docker/prometheus-alert/conf
sudo chown -R fenixadar:fenixadar /data/docker/prometheus-alert/
nano /data/docker/prometheus-alert/conf/app.conf

从 https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/conf/app-example.conf 下载文件并移动到 /data/docker/prometheus-alert/conf/app.conf

6.2 启动

运行以下命令开启prometheus-alert

docker run -d --publish=18080:8080 
-v /data/docker/prometheus-alert/conf/:/app/conf:ro 
-v /data/docker/prometheus-alert/db/:/app/db 
-v /data/docker/prometheus-alert/log/:/app/logs 
--name prometheusalert-center 
feiyu563/prometheus-alert:v-4.5.0

开启成功后，通过http://$ip:18080，访问prometheus-alert界面。用户密码已在 app.conf 中设置。

如果系统开启了防火墙，记得开放白名单

sudo firewall-cmd --zone=public --add-port=18080/tcp --permanent
sudo firewall-cmd --reload

6.3 配置

配置告警模板

点击AlertTemplate，进入http://$ip:18080/template，此处有各类可对接的第三方系统的模板。
以钉钉的告警模板为例，将模版内容改为如下，主要是修正时间显示慢8小时的问题，以及增加一些信息

{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
## [Prometheus恢复信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别：{{$v.labels.level}}
###### 开始时间：{{GetCSTtime $v.startsAt}}
###### 结束时间：{{GetCSTtime $v.endsAt}}
###### 故障主机名：{{$v.labels.hostname}}
###### 故障主机IP：{{$v.labels.instance}}
###### 故障应用：{{$v.labels.app}}
###### 故障主机对象：{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{else}}
## [Prometheus告警信息]({{$v.generatorURL}})
#### [{{$v.labels.alertname}}]({{$var}})
###### 告警级别：{{$v.labels.level}}
###### 开始时间：{{GetCSTtime $v.startsAt}}
###### 故障主机名：{{$v.labels.hostname}}
###### 故障主机IP：{{$v.labels.instance}}
###### 故障应用：{{$v.labels.app}}
###### 故障主机对象：{{$v.labels.subject}}
##### {{$v.annotations.description}}
![Prometheus](https://raw.githubusercontent.com/feiyu563/PrometheusAlert/master/doc/alert-center.png)
{{end}}
{{ end }}

设置钉钉机器人
在钉钉中，新建一个钉钉群，点击“群设置 -> 智能群助手 -> 添加机器人 -> 自定义 -> 安全设置”，把发送信息的服务器IP地址加进去，而后就会有 Webhook 地址。可参考 https://blog.csdn.net/knight_zhou/article/details/105583741

6.4 部署 PrometheusAlert 可参考文档

https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/install.md

七、配置告警规则

我们还需要在Prometheus Server中配置告警规则，告警规则文件引用配置在prometheus.yml文件的rule_files一节中。规则文件格式是yml，依照2.1小节的配置，在/data/docker/prometheus/alert_rules/文件夹创建yml文件，内容如下：

groups:
- name: Node_exporter Down
rules:
- alert: 实例丢失
expr: up{job="node_exporter"} == 0
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.job }}"
address: "{{ $labels.instance }}"
description: "已经有1分钟连接不上实例了."
- alert: CPU使用率过高(> 80)
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 80
for: 1m
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} CPU使用率过高"
description: "{{ $labels.instance }}: CPU使用率超过80%，当前使用率{{ $value }}"
- alert: 内存使用率过高(> 80)
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
#告警持续时间，超过这个时间才会发送给alertmanager
labels:
level: Warning
annotations:
summary: "{{ $labels.instance }} 内存使用率过高"
description: "{{ $labels.instance }}：内存使用率超过80%. 当前使用率{{ $value }}"
- alert: 内存压力过大 (> 1000)
expr: rate(node_vmstat_pgmajfault[1m]) > 1000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 内存压力过大"
description: "{{ $labels.instance }}：内存压力很大. 当前值{{ $value }}"
- alert: 主机网络接口接收了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 2
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})主机入向流量异常"
description: "{{ $labels.instance }}：持续3分钟网口接收太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 主机网络接口发送了太多的数据 (> 2MB/s)
expr: sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机出向流量异常"
description: "{{ $labels.instance }}：持续3分钟网口发送太多数据(> 2MB/s). 当前使用入向流量{{ $value }}MB每秒."
- alert: 磁盘每秒读数据（> 50 MB/s）
expr: sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50
for: 3m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取异常"
description: "{{ $labels.instance }}：主机的IO读取有些问题. 当前值每秒{{ $value }}MB"
- alert: 磁盘每秒写数据（> 50 MB/s）
expr: sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入异常"
description: "{{ $labels.instance }}：主机的IO写入有些问题. 当前值每秒{{ $value }}MB"
# Please add ignored mountpoints in node_exporter parameters like
# "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
# Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
- alert: 磁盘可用空间（<10% left）
expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机磁盘告急"
description: "{{ $labels.instance }}: 主机大约还剩10%的磁盘存储. 当前可用剩余{{ $value }}%"
- alert: 磁盘读取延迟大（>100ms）
expr: rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO读取延迟大"
description: "{{ $labels.instance }}: 主机的IO读取延迟有些大 >100ms . 当前值{{ $value }}"
- alert: 磁盘写入延迟大（>100ms）
expr: rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机IO写入延迟大"
description: "{{ $labels.instance }}: 主机的IO写入延迟有些大 >100ms . 当前值{{ $value }}"
# 1000 context switches is an arbitrary number.
# Alert threshold depends on nature of application.
# Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
#- alert: 上下文切换的节点越来越多(>1500/s)
#
expr: (rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 1500
#
for: 3m
#
labels:
#
level: warning
#
annotations:
#
summary: "(instance {{ $labels.instance }}) 主机上下文节点堆积"
#
description: "{{ $labels.instance }}: 主机上下文节点堆积严重 >1500/s . 当前值{{ $value }}"
- alert: 主机 swap 交换分区使用情况 (> 80%)
expr: (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机交换空间警告"
description: "{{ $labels.instance }}: 主机交换内存到达 > 80% . 当前值{{ $value }}"
- alert: 主机 systemctl 管理的服务 down
expr: node_systemd_unit_state{state="failed"} == 1
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 有systemctl服务被DOWN"
description: "{{ $labels.instance }}: 的{{ $value }}服务被systemctl方式DOWN了"
- alert: 物理机温度过高( >75°)
expr: node_hwmon_temp_celsius > 75
for: 5m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机物理机温度告警"
description: "{{ $labels.instance }}: 主机物理机温度异常( >75°)，当前值{{ $value }}"
- alert: 触发物理节点温度报警
expr: node_hwmon_temp_crit_alarm_celsius == 1
for: 0m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }}) 主机主板温度告警"
description: "{{ $labels.instance }}: 主板的温度过高，当前值{{ $value }}"
- alert: 主机五分钟内接收到错误包
expr: rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络接收到错误包"
description: "主机
{{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf "%.0f" $value }} 接收错误"
- alert: 主机五分钟内发送了错误包
expr: rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 主机网络发送了错误包"
description: "主机
{{ $labels.instance }} interface {{ $labels.device }} 在过去五分钟内遇到了 {{ printf "%.0f" $value }} 接收错误"
- alert: TCP连接时间过长
expr: probe_duration_seconds{job="blackbox_tcp"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接时间大于5秒"
description: "TCP连接时间大于5秒, 当前值{{ $value }}"
- alert: 主机TCP连接数
expr: node_netstat_Tcp_CurrEstab > 800
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) TCP连接数过多"
description: "{{ $labels.instance }}: 检测过多TCP连接 > 800, 当前值{{ $value }}"
- alert: 待关闭的TCP连接数 > 4000
expr: node_sockstat_TCP_tw > 4000
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 等待关闭的TCP连接数 > 4000"
description: "{{ $labels.instance }}: 检测到过多待关闭的TCP连接数, 当前值{{ $value }}"
- alert: 检测到时钟偏差
expr: (node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 检测到时钟偏差"
description: "{{ $labels.instance }}: 检测到时钟偏差。时钟不同步, 当前值{{ $value }}"
- alert: 容器停止运行检测
expr: time() - container_last_seen > 300
for: 0m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }}) 容器可能已经停止运行了"
description: "容器：{{ $labels.name }} - {{ $value }}可能已经停止运行了"
# cAdvisor can sometimes consume a lot of CPU, so this alert will fire constantly.
# If you want to exclude it from this alert, exclude the serie having an empty name: container_cpu_usage_seconds_total{name!=""}
- alert: 容器CPU使用情况
expr: (sum(rate(container_cpu_usage_seconds_total[3m])) BY (instance, name) * 100) > 300
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器CPU过高"
description: "{{ $labels.instance }}: 容器CPU 使用率 >300% , 当前值{{ $value }}%"
# See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
#- alert: 容器内存使用情况
#
expr: (sum(container_memory_working_set_bytes) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 85
#
for: 2m
#
labels:
#
level: warning
#
annotations:
#
summary: "(instance {{ $labels.instance }})容器内存过高"
#
description: "{{ $labels.instance }}: 容器{{ $labels.name }}内存 使用率 >85% , 当前值{{ $value }}%"
- alert: 容器卷使用情况
expr: (1 - (sum(container_fs_inodes_free) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }},Volume 使用率 >80% , 当前值{{ $value }}%"
- alert: 容器卷IO使用率
expr: (sum(container_fs_io_current) BY (instance, name) * 100) > 80
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器Volume IO过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }}，Volume IO使用率 >80% , 当前值{{ $value }}%"
- alert: 容器高字节流情况
expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})容器字节流过高"
description: "{{ $labels.instance }}: 容器{{ $labels.name }}字节流过高"
- alert: Blackbox探针状态
expr: probe_success == 0
for: 5m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})黑盒检测发现问题"
description: "任务组{{ $labels.job }}采集到问题"
- alert: Blackbox慢采集
expr: avg_over_time(probe_duration_seconds[1m]) > 15
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})黑盒采集过慢"
description: "Blackbox用了{{ $value }}秒多的时间才完成, {{ $labels }}"
- alert: Blackbox Ping时间过长
expr: probe_duration_seconds{job="blackbox_ping"} > 5
for: 2m
labels:
level: warning
annotations:
summary: "(instance {{ $labels.instance }})Blackbox Ping时间过长"
description: "Ping 时间大于5秒, {{ $value }},{{ $labels }}"
- alert: Blackbox探测Http失败
expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400
for: 2m
labels:
level: critical
annotations:
summary: "(instance {{ $labels.instance }})Blackbox探测Http失败"
description: "HTTP状态代码不是200-399, {{ $value }},{{ $labels }}"
- alert: SSL 证书 30 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 60m
labels:
level: warning
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在30天后过期，{{ $value }}，{{ $labels }}"
- alert: SSL 证书 3 天后到期
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 3
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书将很快过期(instance {{ $labels.instance }})
description: "SSL 证书将在3天后过期，{{ $value }}，{{ $labels }}"
- alert: SSL 证书已经到期
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 60m
labels:
level: critical
annotations:
summary: Blackbox SSL证书已经过期(instance {{ $labels.instance }})
description: "SSL 证书过期了 ，{{ $value }}，{{ $labels }}"
- alert: Blackbox采集HTTP过慢
expr: avg_over_time(probe_http_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速Http (instance {{ $labels.instance }})"
description: "HTTP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: Blackbox采集ICMP过慢
expr: avg_over_time(probe_icmp_duration_seconds[1m]) > 3
for: 2m
labels:
level: warning
annotations:
summary: "Blackbox 探测慢速icmp (instance {{ $labels.instance }})"
description: "ICMP请求耗时超过3s, 当前值{{ $value }},任务对象{{ $labels.instance }}"
- alert: DNS服务器宕机
expr: probe_dns_answer_rrs == 0
for: 1m
labels:
level: Warning
annotations:
summary: "DNS服务器宕机"
description: "DNS服务器已经有1分钟未响应了，可能已宕机."

配置好规则文件后，重启Prometheus Server，可在http://$ip:9090/rules页面查看规则。可以自行搜索下警报状态相关的知识点。

可参考文档：https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

最后

以上就是缥缈发卡为你收集整理的极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则的全部内容，希望文章能够帮你解决极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错，欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供，作为学习参考使用，或来自网络收集整理，版权属于原作者所有。

本文分类：docker
浏览次数：78 次浏览
发布日期：2023-10-09 03:40:51
本文链接：https://www.kaopuke.com/article/k-p-k_14_uzokfz_13__7__18_5.html

极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则

概述

一、前言与目录

二、部署Prometheus Server

2.1 配置环境

2.2 启动服务端

2.3 部署 Prometheus Server 参考文档

三、部署监控探针

3.1 部署node_exporter

3.2 部署mysqld-exporter

3.3 部署cadvisor

3.4 部署blackbox_exporter

四、部署Grafana

4.1 启动

4.2 配置

4.3 部署 Grafana 可参考文档

五、部署AlertManager

5.1 准备工作

5.1 启动AlertManager

5.3 访问

六、部署PrometheusAlert

6.1 准备工作

6.2 启动

6.3 配置

6.4 部署 PrometheusAlert 可参考文档

七、配置告警规则

最后

评论列表共有 0 条评论

发表评论取消回复

极简Prometheus监控实战一、前言与目录二、部署Prometheus Server三、部署监控探针四、部署Grafana五、部署AlertManager六、部署PrometheusAlert七、配置告警规则

概述

一、前言与目录

二、部署Prometheus Server

2.1 配置环境

2.2 启动服务端

2.3 部署 Prometheus Server 参考文档

三、部署监控探针

3.1 部署node_exporter

3.2 部署mysqld-exporter

3.3 部署cadvisor

3.4 部署blackbox_exporter

四、部署Grafana

4.1 启动

4.2 配置

4.3 部署 Grafana 可参考文档

五、部署AlertManager

5.1 准备工作

5.1 启动AlertManager

5.3 访问

六、部署PrometheusAlert

6.1 准备工作

6.2 启动

6.3 配置

6.4 部署 PrometheusAlert 可参考文档

七、配置告警规则

最后

相关文章

评论列表共有 0 条评论

发表评论 取消回复

发表评论取消回复