我是靠谱客的博主 糊涂外套,最近开发中收集的这篇文章主要介绍prometheus+grafanaprometheus集中管理服务搭建node-exporter节点收集服务搭建alertmanager监控报警服务搭建grafana图形框架服务搭建邮箱报警相关配置文件详解,觉得挺不错的,现在分享给大家,希望可以做个参考。

概述

prometheus集中管理服务搭建

#搭建在监控服务主机上,用于收集节点服务器信息

下载:https://prometheus.io/download/

解压

运行:nohup ./prometheus --config.file=./prometheus.yml &>> ./prometheus.log &

访问http://192.168.1.24:9090

node-exporter节点收集服务搭建

#搭建在需要主机服务器收集的服务器上

下载:https://prometheus.io/download/

解压

运行:nohup ./node_exporter &>> ./node_exporter.log &

重新加载:kill -1 PID

访问http://192.168.1.24:9100

添加到prometheus监控群中:

vim prometheus.yml

添加:

  - job_name: '21'

    static_configs:

    - targets: ['192.168.1.21:9100']

  - job_name: '24'

    static_configs:

    - targets: ['192.168.1.24:9100']

  - job_name: '20'

    static_configs:

    - targets: ['192.168.1.20:9100']

#指定指标数据源的地址,多个地址之间用逗号隔开

alertmanager监控报警服务搭建

搭建在任意服务器上,收集报警信息,信息形式发给运维人员

下载:https://prometheus.io/download/

解压

运行:nohup ./alertmanager --config.file=./alertmanager.yml &>> ./alertmanager.log &

访问:http://192.168.1.24:9093

grafana图形框架服务搭建

人性化web展示,更好的监控服务器性能

下载:https://grafana.com/get

解压

运行:nohup ./grafana-server &>> ./grafana-server.log &

访问:http://192.168.1.24:3000

添加监控主机到grafana上:

 

 

 

点击保存

添加监控模板Kubernetes到grafana中

下载:https://grafana.com/dashboards

 

 

选择下载的模板

 

选择监控主机

添加并查看使用

 

需要收集数据一段时间才会有数据,耐心等待

grafana简单的使用

 

 

邮箱报警

alertmanager.yml指定邮箱的相关信息,详细请看看配置文件详解

prometheus.yml指定alertmanager地址和rule_files地址

vim first_rules.yml指定报警的规则

相关配置文件详解

prometheus.yml

# my global config

global:

  scrape_interval:     15s

用于向pushgateway采集数据的频率,上图所示:每隔15秒向pushgateway采集一次指标数据

  evaluation_interval: 15s

表示规则计算的频率,上图所示:每隔15秒根据所配置的规则集,进行规则计算

  external_labels:

      monitor: 'codelab-monitor'

为指标增加额外的维度,可用于区分不同的prometheus,在应用中多个prometheus可以对应一个alertmanager

# Alertmanager configuration

alerting:

  alertmanagers:

  - static_configs:

设置altermanager的地址,后文会写到安装altermanager

    - targets: ["192.168.1.24:9093"]

      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

指定所配置规则文件,文件中每行可表示一个规则

   - "/work/prometheus-2.5.0.linux-amd64/first_rules.yml"

  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

指定任务名称,在指标中会增加该维度,表示该指标所属的job

  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    static_configs:

    - targets: ['localhost:9090']

  - job_name: '21'

    static_configs:

    - targets: ['192.168.1.21:9100']

  - job_name: '24'

    static_configs:

    - targets: ['192.168.1.24:9100']

  - job_name: '20'

    static_configs:

    - targets: ['192.168.1.20:9100']

指定指标数据源的地址,多个地址之间用逗号隔开

alertmanager.yml

global:

  resolve_timeout: 5m

  smtp_smarthost: 'smtp.163.com:25'

  smtp_from: 'hxqxiaoqi1990@163.com'

  smtp_auth_username: 'hxqxiaoqi1990@163.com'

  smtp_auth_password: 'Hxq7996026'

  smtp_require_tls: false

#邮箱地址

templates:

#指定告警信息展示的模版

  - '/work/alertmanager-0.15.3.linux-amd64/template/123.tmpl'

route:

  group_by: ['alertname']

  group_wait: 10s

  group_interval: 10s

  repeat_interval: 1h

  receiver: 'mail'

receivers:

#- name: 'web.hook'

#  webhook_configs:

#  - url: 'http://127.0.0.1:5001/'

- name: 'mail'

  email_configs:

  - to: 'hxqxiaoqi1990@163.com'

inhibit_rules:

  - source_match:

      severity: 'critical'

    target_match:

      severity: 'warning'

    equal: ['alertname', 'dev', 'instance']

first_rules.yml

groups:

- name: test-rule

  rules:

  - alert: clients

    expr: node_load1 > 1

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "{{$labels.instance}}: Too many clients detected"

      description: "{{$labels.instance}}: Client num is above 80% (current value is: {{ $value }}"

set from=hxqxiaoqi1990@163.com  #作为发送邮件的账号

set smtp=smtp.163.com    #发送邮件的服务器

set smtp-auth-user=hxqxiaoqi1990@163.com   #你的邮箱帐号

set smtp-auth-password=Hxq7996026 #授权码

set smtp-auth=login

cat /dev/urandom | md5sum

内存规则

groups:

- name: test-rule

  rules:

  - alert: "内存报警"

    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 10

    for: 1s

    labels:

      severity: warning

    annotations:

      summary: "服务名:{{$labels.alertname}}"

      description: "业务500报警: {{ $value }}"

      value: "{{ $value }}"

- name: test-rule2

  rules:

  - alert: "内存报警"

    expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes) > 40

    for: 1s

    labels:

      severity: test

    annotations:

      summary: "服务名:{{$labels.alertname}}"

      description: "业务500报警: {{ $value }}"

      value: "{{ $value }}"

 

((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > ${value}

 

cpu规则

100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > ${value}

 

磁盘规则

(node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100 > ${value}

 

流量规则:

(irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > ${value}

应用占比

process_cpu_usage{job="${app}"} * 100 > ${value}

 

报警模板

groups:

- name: down

  rules:

  - alert: "down报警"

    expr: up == 0

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "down报警"

      description: "报警时间:"

      value: "已使用:{{ $value }}"

- name: memory

  rules:

  - alert: "内存报警"

    expr: ((node_memory_MemTotal_bytes -(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes) )/node_memory_MemTotal_bytes ) * 100 > 1

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "内存报警"

      description: "报警时间:"

      value: "已使用:{{ $value }}%"

- name: cpu

  rules:

  - alert: "cpu报警"

    expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 80

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "cpu报警"

      description: "报警时间:"

      value: "已使用:{{ $value }}%"

- name: disk

  rules:

  - alert: "disk报警"

    expr: 100 - (node_filesystem_avail_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} /node_filesystem_size_bytes{fstype !~ "nfs|rpc_pipefs|rootfs|tmpfs",device!~"/etc/auto.misc|/dev/mapper/centos-home",mountpoint !~ "/boot|/net|/selinux"} ) * 100  > 80

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "disk报警"

      description: "报警时间:"

      value: "已使用:{{ $value }}%"

- name: net

  rules:

  - alert: "net报警"

    expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 80000

    for: 1m

    labels:

      severity: warning

    annotations:

      summary: "net报警"

      description: "报警时间:"

      value: "已使用:{{ $value }}KB"

转载于:https://www.cnblogs.com/hxqxiaoqi/p/10647256.html

最后

以上就是糊涂外套为你收集整理的prometheus+grafanaprometheus集中管理服务搭建node-exporter节点收集服务搭建alertmanager监控报警服务搭建grafana图形框架服务搭建邮箱报警相关配置文件详解的全部内容,希望文章能够帮你解决prometheus+grafanaprometheus集中管理服务搭建node-exporter节点收集服务搭建alertmanager监控报警服务搭建grafana图形框架服务搭建邮箱报警相关配置文件详解所遇到的程序开发问题。

如果觉得靠谱客网站的内容还不错,欢迎将靠谱客网站推荐给程序员好友。

本图文内容来源于网友提供,作为学习参考使用,或来自网络收集整理,版权属于原作者所有。
点赞(61)

评论列表共有 0 条评论

立即
投稿
返回
顶部