概述
prometheus集中管理服务搭建
#搭建在监控服务主机上,用于收集节点服务器信息
下载:https://prometheus.io/download/
解压
运行:nohup ./prometheus --config.file=./prometheus.yml &>> ./prometheus.log &
访问http://192.168.1.24:9090
node-exporter节点收集服务搭建
#搭建在需要主机服务器收集的服务器上
下载:https://prometheus.io/download/
解压
运行:nohup ./node_exporter &>> ./node_exporter.log &
重新加载:kill -1 PID
访问http://192.168.1.24:9100
添加到prometheus监控群中:
vim prometheus.yml
添加:
- job_name: '21'
static_configs:
- targets: ['192.168.1.21:9100']
- job_name: '24'
static_configs:
- targets: ['192.168.1.24:9100']
- job_name: '20'
static_configs:
- targets: ['192.168.1.20:9100']
#指定指标数据源的地址,多个地址之间用逗号隔开
alertmanager监控报警服务搭建
搭建在任意服务器上,收集报警信息,信息形式发给运维人员
下载:https://prometheus.io/download/
解压
运行:nohup ./alertmanager --config.file=./alertmanager.yml &>> ./alertmanager.log &
访问:http://192.168.1.24:9093
grafana图形框架服务搭建
人性化web展示,更好的监控服务器性能
下载:https://grafana.com/get
解压
运行:nohup ./grafana-server &>> ./grafana-server.log &
访问:http://192.168.1.24:3000
添加监控主机到grafana上:
点击保存
添加监控模板Kubernetes到grafana中
下载:https://grafana.com/dashboards
选择下载的模板
选择监控主机
添加并查看使用
需要收集数据一段时间才会有数据,耐心等待
grafana简单的使用
邮箱报警
alertmanager.yml指定邮箱的相关信息,详细请看看配置文件详解
prometheus.yml指定alertmanager地址和rule_files地址
vim first_rules.yml指定报警的规则
相关配置文件详解
prometheus.yml
# my global config
global:
scrape_interval: 15s
用于向pushgateway采集数据的频率,上图所示:每隔15秒向pushgateway采集一次指标数据
evaluation_interval: 15s
表示规则计算的频率,上图所示:每隔15秒根据所配置的规则集,进行规则计算
external_labels:
monitor: 'codelab-monitor'
为指标增加额外的维度,可用于区分不同的prometheus,在应用中多个prometheus可以对应一个alertmanager
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
设置altermanager的地址,后文会写到安装altermanager
- targets: ["192.168.1.24:9093"]
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
指定所配置规则文件,文件中每行可表示一个规则
- "/work/prometheus-2.5.0.linux-amd64/first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
指定任务名称,在指标中会增加该维度,表示该指标所属的job
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: '21'
static_configs:
- targets: ['192.168.1.21:9100']
- job_name: '24'
static_configs:
- targets: ['192.168.1.24:9100']
- job_name: '20'
static_configs:
- targets: ['192.168.1.20:9100']
指定指标数据源的地址,多个地址之间用逗号隔开
alertmanager.yml
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'hxqxiaoqi1990@163.com'
smtp_auth_username: 'hxqxiaoqi1990@163.com'
smtp_auth_password: 'Hxq7996026'
smtp_require_tls: false
#邮箱地址
templates:
#指定告警信息展示的模版
- '/work/alertmanager-0.15.3.linux-amd64/template/123.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'mail'
receivers:
#- name: 'web.hook'
# webhook_configs:
# - url: 'http://127.0.0.1:5001/'
- name: 'mail'
email_configs:
- to: 'hxqxiaoqi1990@163.com'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
first_rules.yml
groups:
- name: test-rule
rules:
- alert: clients
expr: node_load1 > 1
for: 1m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: Too many clients detected"
description: "{{$labels.instance}}: Client num is above 80% (current value is: {{ $value }}"
set from=hxqxiaoqi1990@163.com #作为发送邮件的账号
set smtp=smtp.163.com #发送邮件的服务器
set smtp-auth-user=hxqxiaoqi1990@163.com #你的邮箱帐号
set smtp-auth-password=Hxq7996026 #授权码
set smtp-auth=login
cat /dev/urandom | md5sum
内存规则
groups:
- name: test-rule
rules:
- alert: "内存报警"
expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 10
for: 1s
labels:
severity: warning
annotations:
summary: "服务名:{{$labels.alertname}}"
description: "业务500报警: {{ $value }}"
value: "{{ $value }}"
- name: test-rule2
rules:
- alert: "内存报警"
expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40
for: 1s
labels:
severity: test
annotations:
summary: "服务名:{{$labels.alertname}}"
description: "业务500报警: {{ $value }}"
value: "{{ $value }}"
((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > ${value}
cpu规则
100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > ${value}
磁盘规则
(node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > ${value}
流量规则:
(irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > ${value}
应用占比
process_cpu_usage{job="${app}"} * 100 > ${value}
报警模板
groups:
- name: down
rules:
- alert: "down报警"
expr: up == 0
for: 1m
labels:
severity: warning
annotations:
summary: "down报警"
description: "报警时间:"
value: "已使用:{{ $value }}"
- name: memory
rules:
- alert: "内存报警"
expr: ((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > 1
for: 1m
labels:
severity: warning
annotations:
summary: "内存报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: cpu
rules:
- alert: "cpu报警"
expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 80
for: 1m
labels:
severity: warning
annotations:
summary: "cpu报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: disk
rules:
- alert: "disk报警"
expr: 100 - (node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > 80
for: 1m
labels:
severity: warning
annotations:
summary: "disk报警"
description: "报警时间:"
value: "已使用:{{ $value }}%"
- name: net
rules:
- alert: "net报警"
expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 80000
for: 1m
labels:
severity: warning
annotations:
summary: "net报警"
description: "报警时间:"
value: "已使用:{{ $value }}KB"
转载于:https://www.cnblogs.com/hxqxiaoqi/p/10647256.html
最后
以上就是糊涂外套为你收集整理的prometheus+grafanaprometheus集中管理服务搭建node-exporter节点收集服务搭建alertmanager监控报警服务搭建grafana图形框架服务搭建邮箱报警相关配置文件详解的全部内容,希望文章能够帮你解决prometheus+grafanaprometheus集中管理服务搭建node-exporter节点收集服务搭建alertmanager监控报警服务搭建grafana图形框架服务搭建邮箱报警相关配置文件详解所遇到的程序开发问题。
如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。
发表评论 取消回复