儘管可以通過視覺化資料監控系統執行狀態,但我們無法時刻關注系統執行,因此需要一些實時執行的工具能夠輔助監控系統執行,當系統出現執行問題時,能夠通知我們,以此確保系統穩定性,告警便是作為度量指標監控中及其重要的一環。
在Prometheus中,告警模組為Alertmanager,可以提供多種告警通道、方式來使得系統出現問題可以推播告警訊息給相關人員。
Prometheus Server中的告警規則會向Alertmanager傳送。Alertmanager管理這些告警,包括進行去重,分組和路由,以及告警的靜默和抑制,通過電子郵件、實時通知系統和聊天平臺等方法傳送通知。
Alertmanager是一個獨立的告警模組,接收Prometheus server發來的告警規則,通過去重、分組、靜默和抑制等處理,並將它們通過路由傳送給正確的接收器。
cd /opt
mkdir alertmanager
cd alertmanager
touch alertmanager.yml
組態檔主要包含以下幾個部分:
組態檔內容:
global:
resolve_timeout: 2m
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: [email protected]
smtp_auth_username: [email protected]
smtp_auth_password: 授權碼
templates:
- /etc/alertmanager/template/*.tmpl
route:
group_by:
- alertname_wechat
group_wait: 10s
group_interval: 10s
receiver: wechat
repeat_interval: 1h
receivers:
- name: wechat
email_configs:
- to: [email protected]
send_resolved: true
wechat_configs:
- corp_id: wechat_corp_id
to_party: wechat_to_party
agent_id: wechat_agent_id
api_secret: wechat_apisecret
send_resolved: true
此處設定推播訊息到郵箱和企業微信,Alertmanager內建了對企業微信的支援。
https://prometheus.io/docs/alerting/latest/configuration/
cd /opt/alertmanager
mkdir template
cd template
touch wechat.tmpl
編輯檔案內容,設定模板格式
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
========監控報警==========
告警狀態:{{ .Status }}
告警級別:{{ $alert.Labels.severity }}
告警型別:{{ $alert.Labels.alertname }}
告警應用:{{ $alert.Annotations.summary }}
告警主機:{{ $alert.Labels.instance }}
告警詳情:{{ $alert.Annotations.description }}
觸發閥值:{{ $alert.Annotations.value }}
告警時間:{{ $alert.StartsAt.Format "2023-02-19 10:00:00" }}
========end=============
{{ end }}
{{ end }}
docker run -d -p 9093:9093 --name StarCityAlertmanager -v /opt/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml -v /opt/alertmanager/template:/etc/alertmanager/template docker.io/prom/alertmanager:latest
可通過請求Alertmanager Api模擬告警規則來推播告警通知。
curl --location 'http://Host:9093/api/v2/alerts' \
--header 'Content-Type: application/json' \
--data '[
{
"labels": {
"severity": "Warning",
"alertname": "記憶體使用過高",
"instance": "範例1",
"msgtype": "testing"
},
"annotations": {
"summary": "node",
"description": "請檢查範例1",
"value": "0.95"
}
},
{
"labels": {
"severity": "Warning",
"alertname": "CPU使用過高",
"instance": "範例2",
"msgtype": "testing"
},
"annotations": {
"summary": "node",
"description": "請檢查範例2",
"value": "0.90"
}
}
]'
傳送完畢,可以在企業微信和郵件中收到告警通知,如在郵箱中收到資訊。
注意:如果告警設定完畢,但測試時企業微信怎麼也收不到訊息,需要設定企業自建應用底部可信IP選單
cd /opt/prometheus
touch rules.yml
告警規則內容
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
name: instance
severity: Critical
annotations:
summary: ' {{ $labels.appname }}'
description: ' The service stops running '
value: '{{ $value }}%'
- name: Host
rules:
- alert: HostMemory Usage
expr: >-
(node_memory_MemTotal_bytes - (node_memory_MemFree_bytes +
node_memory_Buffers_bytes + node_memory_Cached_bytes)) /
node_memory_MemTotal_bytes * 100 > 80
for: 10s
labels:
name: Memory
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: ' The instance memory usage exceeded 80%. '
value: '{{ $value }}'
- alert: HostCPU Usage
expr: >-
sum(avg without
(cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by
(instance,appname) > 0.65
for: 10s
labels:
name: CPU
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: The CPU usage of the instance is exceeded 65%.
value: '{{ $value }}'
- alert: HostLoad
expr: node_load5 > 4
for: 10s
labels:
name: Load
severity: Warning
annotations:
summary: '{{ $labels.appname }} '
description: ' The instance load exceeds the default value for 5 minutes.'
value: '{{ $value }}'
- alert: HostFilesystem Usage
expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) > 0.8
for: 10s
labels:
name: Disk
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: ' The instance [ {{ $labels.mountpoint }} ] partitioning is used by more than 80%.'
value: '{{ $value }}%'
- alert: HostDiskio
expr: 'irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10'
for: 10s
labels:
name: Diskio
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: ' The instance [{{ $labels.device }}] average write IO load of the disk is high in 1 minute.'
value: '{{ $value }}iops'
- alert: Network_receive
expr: >-
irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m])
/ 1048576 > 3
for: 10s
labels:
name: Network_receive
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: ' The instance [{{ $labels.device }}] average traffic received by the NIC exceeds 3Mbps in 5 minutes.'
value: '{{ $value }}3Mbps'
- alert: Network_transmit
expr: >-
irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m])
/ 1048576 > 3
for: 10s
labels:
name: Network_transmit
severity: Warning
annotations:
summary: ' {{ $labels.appname }} '
description: ' The instance [{{ $labels.device }}] average traffic sent by the network card exceeds 3Mbps in 5 minutes.'
value: '{{ $value }}3Mbps'
- name: Container
rules:
- alert: ContainerCPU Usage
expr: >-
(sum by(name,instance)
(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
for: 10s
labels:
name: CPU
severity: Warning
annotations:
summary: '{{ $labels.name }} '
description: ' Container CPU usage over 60%.'
value: '{{ $value }}%'
- alert: ContainerMem Usage
expr: 'container_memory_usage_bytes{name=~".+"} / 1048576 > 1024'
for: 10s
labels:
name: Memory
severity: Warning
annotations:
summary: '{{ $labels.name }} '
description: ' Container memory usage exceeds 1GB.'
value: '{{ $value }}G'
修改prometheus.yml,增加告警規則,如下alerting和rule_files部分,重啟prometheus。
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['Host:9093']
rule_files:
- "rules.yml"
scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- 'Prometheus Server Host:9090'
labels:
appname: Prometheus
- job_name: node
scrape_interval: 10s
static_configs:
- targets:
- 'Metrics Host:9100'
labels:
appname: node
- job_name: cadvisor
static_configs:
- targets:
- 'Metrics Host:58080'
- job_name: rabbitmq
scrape_interval: 10s
static_configs:
- targets:
- 'Metrics Host:9419'
labels:
appname: rabbitmq
再次存取prometheus可以看到告警規則
狀態 | 說明 |
---|---|
Inactive | 待啟用狀態,度量指標處在合適範圍內。 |
Pending | 符合告警規則,但是低於設定的持續時間。這裡的持續時間即rule裡的FOR欄位設定的時間。該狀態下不傳送告警通知。 |
Firing | 符合告警規則,而且超出設定的持續時間。該狀態下傳送告警到Alertmanager。 |
當系統達到預定告警條件並超出設定的持續時間,則觸發告警,推播告警訊息到Alertmanager。
此處設定系統CPU使用率超過限定條件,可以在prometheus中看到CPU使用率告警規則達到Pending狀態
當超過設定的持續時間,狀態變更到Firing,訊息推播到Alertmanager
在Alertmanager Web中可以看到推播過來的告警資訊
在企業微信和郵箱中也收到告警資訊
2023-02-23,望技術有成後能回來看見自己的腳步