Kubernetes 中大量用到了證書, 比如 ca證書、以及 kubelet、apiserver、proxy、etcd等元件,還有 kubeconfig 檔案。
如果證書過期,輕則無法登入 Kubernetes 叢集,重則整個叢集異常。
為了解決證書過期的問題,一般有以下幾種方式:
本次主要介紹關於 Kubernetes 叢集證書過期的監控,這裡提供 3 種監控方案:
/etc/kubernetes/pki
和 /var/lib/kubelet
下的證書以及 kubeconfig 檔案Blackbox Exporter 用於探測 HTTPS、HTTP、TCP、DNS、ICMP 和 grpc 等 Endpoint。在你定義 Endpoint 後,Blackbox Exporter 會生成指標,可以使用 Grafana 等工具進行視覺化。Blackbox Exporter 最重要的功能之一是測量 Endpoint 的可用性。
當然, Blackbox Exporter 探測 HTTPS 後就可以獲取到證書的相關資訊, 就是利用這種方式實現對 Kubernetes apiserver 證書過期時間的監控.
調整 Blackbox Exporter 的設定, 增加 insecure_tls_verify: true
, 如下:
重啟 blackbox exporter: kubectl rollout restart deploy ...
增加對 Kubernetes APIServer 內部端點https://kubernetes.default.svc.cluster.local/readyz的監控.
如果你沒有使用 Prometheus Operator, 使用的是原生的 Prometheus, 則需要修改 Prometheus 組態檔的 configmap 或 secret, 新增 scrape config, 範例如下:
如果在使用 Prometheus Operator, 則可以增加如下 Probe CRD, Prometheus Operator 會自動將其轉換並 merge 到 Prometheus 中.
apiVersion: monitoring.coreos.com/v1
kind: Probe
metadata:
name: kubernetes-apiserver
spec:
interval: 60s
module: http_2xx
prober:
path: /probe
url: monitor-prometheus-blackbox-exporter.default.svc.cluster.local:9115
targets:
staticConfig:
static:
- https://kubernetes.default.svc.cluster.local/readyz
最後, 可以增加 Prometheus 告警 Rule, 這裡就直接用 Prometheus Operator 建立 PrometheusRule CRD 做範例了, 範例如下:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: prometheus-blackbox-exporter
spec:
groups:
- name: prometheus-blackbox-exporter
rules:
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 0m
labels:
severity: warning
- alert: BlackboxSslCertificateWillExpireSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 0m
labels:
severity: critical
- alert: BlackboxSslCertificateExpired
annotations:
description: |-
SSL certificate has expired already
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: SSL certificate expired (instance {{ $labels.instance }})
expr: probe_ssl_earliest_cert_expiry - time() <= 0
for: 0m
labels:
severity: emergency
這裡可以參考我的文章:Prometheus Operator 與 kube-prometheus 之二 - 如何監控 1.23+ kubeadm 叢集, 安裝完成後, 開箱即用.
開箱即用內容包括:
這裡用到的指標有:
apiserver_client_certificate_expiration_seconds_count
apiserver_client_certificate_expiration_seconds_bucket
kubelet_certificate_manager_client_expiration_renew_errors
kubelet_server_expiration_renew_errors
kubelet_certificate_manager_client_ttl_seconds
kubelet_certificate_manager_server_ttl_seconds
對應的 Prometheus 告警規則如下:
該 Exporter 是通過監控叢集所有node的指定目錄或 path 下的證書檔案以及 kubeconfig 檔案來獲取證書資訊.
如果是使用 kubeadm 搭建的 Kubernetes 叢集, 則可以監控如下包含證書的檔案和 kubeconfig:
watchFiles:
- /var/lib/kubelet/pki/kubelet-client-current.pem
- /etc/kubernetes/pki/apiserver.crt
- /etc/kubernetes/pki/apiserver-etcd-client.crt
- /etc/kubernetes/pki/apiserver-kubelet-client.crt
- /etc/kubernetes/pki/ca.crt
- /etc/kubernetes/pki/front-proxy-ca.crt
- /etc/kubernetes/pki/front-proxy-client.crt
- /etc/kubernetes/pki/etcd/ca.crt
- /etc/kubernetes/pki/etcd/healthcheck-client.crt
- /etc/kubernetes/pki/etcd/peer.crt
- /etc/kubernetes/pki/etcd/server.crt
watchKubeconfFiles:
- /etc/kubernetes/admin.conf
- /etc/kubernetes/controller-manager.conf
- /etc/kubernetes/scheduler.conf
編輯 values.yaml:
kubeVersion: ''
extraLabels: {}
nameOverride: ''
fullnameOverride: ''
imagePullSecrets: []
image:
registry: docker.io
repository: enix/x509-certificate-exporter
tag:
pullPolicy: IfNotPresent
psp:
create: false
rbac:
create: true
secretsExporter:
serviceAccountName:
serviceAccountAnnotations: {}
clusterRoleAnnotations: {}
clusterRoleBindingAnnotations: {}
hostPathsExporter:
serviceAccountName:
serviceAccountAnnotations: {}
clusterRoleAnnotations: {}
clusterRoleBindingAnnotations: {}
podExtraLabels: {}
podAnnotations: {}
exposePerCertificateErrorMetrics: false
exposeRelativeMetrics: false
metricLabelsFilterList: null
secretsExporter:
enabled: true
debugMode: false
replicas: 1
restartPolicy: Always
strategy: {}
resources:
limits:
cpu: 200m
memory: 150Mi
requests:
cpu: 20m
memory: 20Mi
nodeSelector: {}
tolerations: []
affinity: {}
podExtraLabels: {}
podAnnotations: {}
podSecurityContext: {}
securityContext:
runAsUser: 65534
runAsGroup: 65534
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
secretTypes:
- type: kubernetes.io/tls
key: tls.crt
includeNamespaces: []
excludeNamespaces: []
includeLabels: []
excludeLabels: []
cache:
enabled: true
maxDuration: 300
hostPathsExporter:
debugMode: false
restartPolicy: Always
updateStrategy: {}
resources:
limits:
cpu: 100m
memory: 40Mi
requests:
cpu: 10m
memory: 20Mi
nodeSelector: {}
tolerations: []
affinity: {}
podExtraLabels: {}
podAnnotations: {}
podSecurityContext: {}
securityContext:
runAsUser: 0
runAsGroup: 0
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
watchDirectories: []
watchFiles: []
watchKubeconfFiles: []
daemonSets:
cp:
nodeSelector:
node-role.kubernetes.io/master: ''
tolerations:
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
watchFiles:
- /var/lib/kubelet/pki/kubelet-client-current.pem
- /etc/kubernetes/pki/apiserver.crt
- /etc/kubernetes/pki/apiserver-etcd-client.crt
- /etc/kubernetes/pki/apiserver-kubelet-client.crt
- /etc/kubernetes/pki/ca.crt
- /etc/kubernetes/pki/front-proxy-ca.crt
- /etc/kubernetes/pki/front-proxy-client.crt
- /etc/kubernetes/pki/etcd/ca.crt
- /etc/kubernetes/pki/etcd/healthcheck-client.crt
- /etc/kubernetes/pki/etcd/peer.crt
- /etc/kubernetes/pki/etcd/server.crt
watchKubeconfFiles:
- /etc/kubernetes/admin.conf
- /etc/kubernetes/controller-manager.conf
- /etc/kubernetes/scheduler.conf
nodes:
watchFiles:
- /var/lib/kubelet/pki/kubelet-client-current.pem
- /etc/kubernetes/pki/ca.crt
rbacProxy:
enabled: false
podListenPort: 9793
hostNetwork: false
service:
create: true
port: 9793
annotations: {}
extraLabels: {}
prometheusServiceMonitor:
create: true
scrapeInterval: 60s
scrapeTimeout: 30s
extraLabels: {}
relabelings: {}
prometheusPodMonitor:
create: false
prometheusRules:
create: true
alertOnReadErrors: true
readErrorsSeverity: warning
alertOnCertificateErrors: true
certificateErrorsSeverity: warning
certificateRenewalsSeverity: warning
certificateExpirationsSeverity: critical
warningDaysLeft: 30
criticalDaysLeft: 14
extraLabels: {}
alertExtraLabels: {}
rulePrefix: ''
disableBuiltinAlertGroup: false
extraAlertGroups: []
extraDeploy: []
通過 Helm Chart 安裝:
helm repo add enix https://charts.enix.io
helm install x509-certificate-exporter enix/x509-certificate-exporter
通過這個 Helm Chart 也會自動安裝:
其監控指標為:
x509_cert_not_after
該 Exporter 還提供了一個比較花哨的 Grafana Dashboard, 如下:
Alert Rules 如下:
為了監控 Kubernetes 叢集的證書過期時間, 我們提供了 3 種方案, 各有優劣:
/etc/kubernetes/pki
和 /var/lib/kubelet
下的證書以及 kubeconfig 檔案
可以根據您的實際情況靈活進行選擇.