Kubernetes Prometheus 监控部署方案(原生方式)
引入:用"连锁超市"理解监控系统
第一阶段:环境准备与镜像下载
确认环境,确保 kubectl 能连接集群,且集群至少有 1 个节点:
# 检查节点状态
kubectl get nodes
# 创建命名空间
kubectl create namespace monitoring预拉取镜像:
# 下载 Prometheus 主程序(推荐使用具体版本号,避免 latest 突变)
docker pull prom/prometheus:v2.55.1
# 下载 Node Exporter(用于采集服务器硬件信息)
docker pull prom/node-exporter:v1.8.2
# 下载 Grafana(用于展示数据)
docker pull grafana/grafana:10.4.2第二阶段:构建基础设施(RBAC & Namespace)
Prometheus 需要权限去巡查整个集群。创建文件 01-rbac.yaml:
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus-sa
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-role
rules:
- apiGroups: [""]
resources: ["nodes", "nodes/proxy", "services", "endpoints", "pods", "configmaps"]
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: prometheus-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-role
subjects:
- kind: ServiceAccount
name: prometheus-sa
namespace: monitoring执行:
kubectl apply -f 01-rbac.yaml第三阶段:编写配置文件 ConfigMap
ConfigMap 用于告诉 Prometheus "去哪里抓数据" 和 "什么情况下报警",运用了基于 Label 的发现方式。
创建文件 02-configmap.yaml:
注意:该文件中监控和告警配置是最简单的,实践中需要根据实际情况修改。
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
# 主配置文件 prometheus.yml
prometheus.yml: |
global:
scrape_interval: 15s # 每 15 秒去抄一次表
evaluation_interval: 15s # 每 15 秒评估一次告警规则
# 告警规则文件位置
rule_files:
- /etc/prometheus/rules/*.rules
# 抓取配置 (Jobs)
scrape_configs:
# 1. 监控 Prometheus 自己
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# 2. 监控 K8s API Server(核心组件)
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# 3. 监控所有节点(通过 Node Exporter)
# 逻辑:自动发现所有 K8s 节点,然后去抓它们的 9100 端口
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: node
relabel_configs:
- source_labels: [__address__]
regex: '(.+):(.+)'
replacement: '${1}:9100' # 将端口替换为 9100
target_label: __address__
# 4. 基于 Label 的业务发现(推荐)
# 只要 Pod 带有 label: monitor=true,就会被自动抓取
- job_name: 'kubernetes-pods-label'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_monitor]
action: keep
regex: true
- source_labels: [__address__, __meta_kubernetes_pod_container_port_number]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
replacement: ${1}
# 5. 基于注解的业务发现(备选方案)
# 只要 Pod 上有 prometheus.io/scrape: "true" 注解,就会被自动监控
- job_name: 'kubernetes-pods-annotation'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# 告警规则文件 alert.rules
alert.rules: |
groups:
- name: basic_alerts
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} has restarted more than 0 times in 5 minutes."
- alert: HighNodeCPU
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on node {{ $labels.instance }}"
description: "CPU usage is above 80% for more than 5 minutes."执行:
kubectl apply -f 02-configmap.yaml第四阶段:部署 Prometheus(带持久化存储)
使用 PersistentVolumeClaim 保证数据不丢失,设置 15 天 retention。
创建文件 03-prometheus-deployment.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: monitoring
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
# storageClassName: standard # 如有特殊存储类请取消注释
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
labels:
app: prometheus
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
# 1. 使用之前创建的专用服务账户(通行证)
serviceAccountName: prometheus-sa
containers:
- name: prometheus
# 2. 指定镜像地址和版本
image: prom/prometheus:v2.55.1
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9090
name: http
# 3. 启动参数:指定配置文件路径和数据存储路径
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=15d"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
# 4. 挂载配置文件(从 ConfigMap 来)
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: data-volume
mountPath: /prometheus
# 健康检查
livenessProbe:
httpGet:
path: /-/healthy
port: 9090
initialDelaySeconds: 30
periodSeconds: 15
readinessProbe:
httpGet:
path: /-/ready
port: 9090
initialDelaySeconds: 5
periodSeconds: 5
resources:
requests:
memory: "512Mi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "2000m"
# 5. 定义卷来源
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: data-volume
persistentVolumeClaim:
claimName: prometheus-data
---
# 暴露服务,让我们能访问网页
apiVersion: v1
kind: Service
metadata:
name: prometheus-service
namespace: monitoring
spec:
selector:
app: prometheus
ports:
- protocol: TCP
port: 9090
targetPort: 9090
type: NodePort # 使用 NodePort 方便外部访问,生产环境建议用 LoadBalancer 或 Ingress执行:
kubectl apply -f 03-prometheus-deployment.yaml第五阶段:部署 Node Exporter DaemonSet
Prometheus 本体跑起来了,但它还不知道服务器的 CPU/内存使用情况。我们需要在每个节点上部署 node-exporter。
创建文件 04-node-exporter.yaml:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
labels:
app: node-exporter
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
hostNetwork: true # 使用宿主机网络,直接获取真实 IP
hostPID: true # 使用宿主机 PID 命名空间
tolerations:
- effect: NoSchedule
operator: Exists
containers:
- name: node-exporter
image: prom/node-exporter:v1.8.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 9100
hostPort: 9100
args:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($|/)"
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
resources:
requests:
memory: "50Mi"
cpu: "100m"
limits:
memory: "100Mi"
cpu: "200m"
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys执行:
kubectl apply -f 04-node-exporter.yaml第六阶段:安装 Grafana 让数据可视化
Prometheus 自带的界面只能查数据,可视化体验较差。通常搭配 Grafana 使用。
创建文件 05-grafana.yaml:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: grafana-data
namespace: monitoring
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 5Gi
---
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-service.monitoring.svc:9090
isDefault: true
access: proxy
editable: true
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:10.4.2
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
env:
- name: GF_SECURITY_ADMIN_PASSWORD
value: "admin123" # 设置密码
volumeMounts:
- name: grafana-storage
mountPath: /var/lib/grafana
- name: datasources
mountPath: /etc/grafana/provisioning/datasources
resources:
requests:
memory: "256Mi"
cpu: "200m"
limits:
memory: "1Gi"
cpu: "1000m"
volumes:
- name: grafana-storage
persistentVolumeClaim:
claimName: grafana-data
- name: datasources
configMap:
name: grafana-datasources
---
apiVersion: v1
kind: Service
metadata:
name: grafana-service
namespace: monitoring
spec:
selector:
app: grafana
ports:
- port: 3000
targetPort: 3000
type: NodePort执行:
kubectl apply -f 05-grafana.yaml手动配置数据源(如自动配置未生效)
访问
http://<节点IP>:<Grafana端口>(通过kubectl get svc -n monitoring查看端口)登录(用户:
admin,密码:admin123)点击 Connections → Data Sources → Add data source → 选择 Prometheus
在 URL 栏填入 Prometheus 的内网地址:
http://prometheus-service.monitoring.svc:9090点击 Save & Test,看到绿色对勾即成功
导入仪表盘
点击 Dashboards → New → Import
输入社区热门模板 ID,例如 1860(Node Exporter Full)或 315(Kubernetes Cluster)
点击 Load,选择刚才配置的 Prometheus 数据源,点击 Import
第七阶段:验证与访问
检查所有组件是否运行:
kubectl get pods -n monitoring预期结果:
prometheus-xxxxx:状态应为Running (1/1)node-exporter-xxxxx:应该有多个(等于节点数量),状态均为Running (1/1)grafana-xxx:1/1 Running
如有 Pending 或 CrashLoopBackOff,使用 kubectl describe pod <pod-name> -n monitoring 查看报错。
获取访问地址:
kubectl get svc -n monitoring打开浏览器访问 Prometheus UI,点击顶部菜单 Status → Targets,如果看到 kubernetes-apiservers、node-exporter、prometheus 等任务的状态都是 UP(绿色),则部署成功。
第八阶段:后期如何监控业务?
假设有一个名为 ezcloud-mqtt-auth 的业务,有两种方式让它出现在监控中:
方式 A:打标签(推荐,无需改业务 YAML 里的注解)
在你的 Deployment YAML 的 metadata.labels 中添加 monitor: "true":
apiVersion: apps/v1
kind: Deployment
metadata:
name: ezcloud-mqtt-auth
labels:
app: ezcloud-mqtt-auth
monitor: "true" # <--- 加上这个
spec:
# ...Prometheus 会自动发现并抓取该 Pod 的默认端口 /metrics。
方式 B:加注解(备选)
在 Pod Template 的 metadata.annotations 中添加:
template:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8800" # 指定端口
prometheus.io/path: "/actuator/prometheus" # 指定路径应用业务后,再次检查 Prometheus Targets,找到业务 Job 的状态应为 UP。