📚 透過 OpenTelemetry Operator 學習 Kubernetes Operator 開發指南
📚 透過 OpenTelemetry Operator 深度學習 Kubernetes Operator 開發
本教程基於生產級專案 OpenTelemetry Operator 的實際代碼
涵蓋從基礎概念到高級實戰的完整學習路徑
目錄
一、Kubernetes Operator 核心概念
1.1 什麼是 Operator?
Operator 是 Kubernetes 的一種擴展模式,用於自動化複雜應用的部署和管理。它將人類運維知識編碼到軟體中。
核心組成部分
┌─────────────────────────────────────────────────────────┐
│ Kubernetes Operator │
│ │
│ ┌────────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Custom │ │ Controller │ │ Domain │ │
│ │ Resource │◄─┤ (Reconciler)├─►│ Knowledge │ │
│ │ Definition │ │ │ │ │ │
│ └────────────────┘ └──────────────┘ └─────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ 擴展 K8s API 監控&調和狀態 運維邏輯編碼 │
└─────────────────────────────────────────────────────────┘
1.2 工作原理
User (kubectl apply)
│
▼
┌─────────────────────────────────────┐
│ Custom Resource (CR) │
│ 例如: OpenTelemetryCollector │
│ │
│ apiVersion: opentelemetry.io/v1beta1│
│ kind: OpenTelemetryCollector │
│ spec: │
│ mode: deployment │
│ config: {...} │
└─────────────────────────────────────┘
│ (儲存到 etcd)
▼
┌─────────────────────────────────────┐
│ Kubernetes API Server │
│ (發送 Watch 事件) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Controller (Reconciler) │
│ │
│ 1. 接收事件 │
│ 2. 讀取 CR │
│ 3. 計算期望狀態 │
│ 4. 調和當前狀態 → 期望狀態 │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Kubernetes Resources │
│ - Deployment │
│ - Service │
│ - ConfigMap │
│ - ServiceAccount │
│ - ... │
└─────────────────────────────────────┘
1.3 Operator 能力等級
| 等級 | 能力 | OpenTelemetry Operator 支持 |
| 1 | 基本安裝 | ✅ |
| 2 | 無縫升級 | ✅ (自動版本升級) |
| 3 | 完整生命週期 | ✅ (Finalizer、備份配置) |
| 4 | 深度洞察 | ✅ (Metrics、Events) |
| 5 | 自動調優 | ✅ (HPA、Target Allocator) |
二、OpenTelemetry Operator 架構深度解析
2.1 專案完整結構
opentelemetry-operator/
├── main.go # 程式入口點 (200+ 行)
│
├── apis/ # CRD API 定義
│ ├── v1alpha1/ # Alpha API
│ │ ├── instrumentation_types.go # 自動埋點 CRD
│ │ ├── opampbridge_types.go # OpAMP Bridge CRD
│ │ └── targetallocator_types.go # Target Allocator CRD (deprecated)
│ └── v1beta1/ # Beta API (穩定版)
│ ├── opentelemetrycollector_types.go # Collector CRD (主要)
│ ├── targetallocator_types.go # Target Allocator CRD
│ └── config.go # 配置解析
│
├── internal/ # 內部實現
│ ├── controllers/ # 控制器實現
│ │ ├── opentelemetrycollector_controller.go # 主控制器 (400+ 行)
│ │ ├── targetallocator_controller.go
│ │ ├── opampbridge_controller.go
│ │ └── reconcile_test.go # 測試 (56KB!)
│ │
│ ├── manifests/ # K8s 資源構建器
│ │ ├── collector/ # Collector 資源構建
│ │ │ ├── deployment.go # Deployment 構建
│ │ │ ├── daemonset.go # DaemonSet 構建
│ │ │ ├── statefulset.go # StatefulSet 構建
│ │ │ ├── container.go # Container 構建 (核心邏輯)
│ │ │ ├── service.go # Service 構建
│ │ │ ├── configmap.go # ConfigMap 構建
│ │ │ └── ...
│ │ ├── targetallocator/ # Target Allocator 資源構建
│ │ └── manifestutils/ # 工具函數
│ │
│ ├── webhook/ # Admission Webhooks
│ │ ├── podmutation/ # Pod Mutation Webhook
│ │ │ └── webhookhandler.go # Sidecar/Instrumentation 注入
│ │ └── validation/ # Validation Webhook
│ │
│ ├── config/ # Operator 配置管理
│ ├── rbac/ # RBAC 工具
│ ├── autodetect/ # 環境自動檢測
│ │ ├── prometheus/ # Prometheus Operator 檢測
│ │ ├── openshift/ # OpenShift 檢測
│ │ └── certmanager/ # cert-manager 檢測
│ └── status/ # Status 更新邏輯
│
├── pkg/ # 公開包
│ ├── collector/ # Collector 工具
│ │ └── upgrade/ # 版本升級邏輯
│ ├── sidecar/ # Sidecar 注入邏輯
│ ├── featuregate/ # Feature Gate 管理
│ └── constants/ # 常量定義
│
├── config/ # Kustomize 部署配置
│ ├── crd/ # CRD YAML 文件
│ │ └── bases/ # 基礎 CRD
│ ├── rbac/ # RBAC 規則
│ ├── manager/ # Operator Deployment
│ ├── webhook/ # Webhook 配置
│ └── samples/ # CR 範例
│
├── tests/ # 測試套件
│ ├── e2e/ # 基礎 E2E 測試
│ ├── e2e-instrumentation/ # 自動埋點測試
│ ├── e2e-targetallocator/ # Target Allocator 測試
│ ├── e2e-upgrade/ # 升級測試
│ └── test-e2e-apps/ # 測試應用
│
├── cmd/ # 額外的可執行程序
│ ├── otel-allocator/ # Target Allocator 服務
│ ├── operator-opamp-bridge/ # OpAMP Bridge 服務
│ └── gather/ # 故障排查工具
│
├── Makefile # 構建腳本
├── go.mod # Go 依賴管理
└── versions.txt # 版本管理文件
2.2 四大核心 CRD 詳解
2.2.1 OpenTelemetryCollector (主要 CRD)
檔案: apis/v1beta1/opentelemetrycollector_types.go
用途: 管理 OpenTelemetry Collector 的部署
支持的部署模式:
const (
ModeDeployment Mode = "deployment" // 標準部署
ModeDaemonSet Mode = "daemonset" // 每個節點一個實例
ModeStatefulSet Mode = "statefulset" // 有狀態部署
ModeSidecar Mode = "sidecar" // 注入到其他 Pod
)
功能特性:
✅ 多種部署模式
✅ 自動配置管理(ConfigMap 版本控制)
✅ Target Allocator 集成
✅ HPA 自動擴縮容
✅ 版本自動升級
✅ Ingress/Route 支持
✅ PodMonitor/ServiceMonitor 支持
2.2.2 Instrumentation
檔案: apis/v1alpha1/instrumentation_types.go
用途: 配置自動埋點(Auto-instrumentation)
支持的語言:
Java (OpenTelemetry Java Agent)
Node.js (OpenTelemetry Node.js)
Python (OpenTelemetry Python)
.NET (OpenTelemetry .NET)
Go (eBPF-based)
Apache HTTPD
Nginx
2.2.3 TargetAllocator
檔案: apis/v1beta1/targetallocator_types.go
用途: 分配 Prometheus 抓取目標到多個 Collector 實例
分配策略:
least-weighted: 最少加權(基於目標數量)consistent-hashing: 一致性哈希per-node: 每個節點一個 Allocator
2.2.4 OpAMPBridge
檔案: apis/v1alpha1/opampbridge_types.go
用途: 實現 OpAMP 協議,實現遠程管理 Collector
三、CRD 完整剖析與實戰
3.1 CRD 結構深度解析
3.1.1 OpenTelemetryCollector CRD 完整定義
檔案: apis/v1beta1/opentelemetrycollector_types.go:32-145
// +kubebuilder:object:root=true
// +kubebuilder:resource:shortName=otelcol;otelcols
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.scale.replicas,selectorpath=.status.scale.selector
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Version",type="string",JSONPath=".status.version",description="OpenTelemetry Version"
// +kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.scale.statusReplicas"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// +kubebuilder:printcolumn:name="Image",type="string",JSONPath=".status.image"
// +kubebuilder:printcolumn:name="Management",type="string",JSONPath=".spec.managementState",description="Management State"
type OpenTelemetryCollector struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec OpenTelemetryCollectorSpec `json:"spec,omitempty"`
Status OpenTelemetryCollectorStatus `json:"status,omitempty"`
}
3.1.2 Kubebuilder 註解完全指南
基礎註解
| 註解 | 說明 | 範例 |
+kubebuilder:object:root=true | 標記為 CRD 根物件 | 必須添加 |
+kubebuilder:subresource:status | 啟用 status 子資源 | 允許獨立更新 status |
+kubebuilder:subresource:scale | 啟用 scale 子資源 | 支持 kubectl scale |
+kubebuilder:resource:shortName | 定義簡稱 | otelcol,otelcols |
+kubebuilder:storageversion | 標記為存儲版本 | 多版本時指定 |
驗證註解
// 數值驗證
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=100
Replicas int32 `json:"replicas,omitempty"`
// 字符串驗證
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=255
// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
Name string `json:"name,omitempty"`
// 枚舉驗證
// +kubebuilder:validation:Enum=deployment;daemonset;statefulset;sidecar
Mode string `json:"mode,omitempty"`
// 默認值
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`
// CEL 驗證 (Kubernetes 1.25+)
// +kubebuilder:validation:XValidation:rule="self.mode != 'sidecar' || !has(self.replicas)",message="sidecar mode does not support replicas"
Spec OpenTelemetryCollectorSpec `json:"spec,omitempty"`
顯示註解
// kubectl get 輸出欄位
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
3.1.3 完整 Spec 結構
type OpenTelemetryCollectorSpec struct {
// ===== 部署配置 =====
// 部署模式
// +optional
// +kubebuilder:default=deployment
Mode Mode `json:"mode,omitempty"`
// 副本數(僅 Deployment/StatefulSet 模式)
// +optional
Replicas *int32 `json:"replicas,omitempty"`
// 容器鏡像
// +optional
Image string `json:"image,omitempty"`
// ===== Collector 配置 =====
// Collector 配置(YAML 格式)
// +required
// +kubebuilder:pruning:PreserveUnknownFields
Config Config `json:"config"`
// ConfigMap 版本保留數量(用於回滾)
// +optional
// +kubebuilder:default:=3
// +kubebuilder:validation:Minimum:=1
ConfigVersions int `json:"configVersions,omitempty"`
// ===== 升級策略 =====
// 升級策略:automatic 或 none
// +optional
UpgradeStrategy UpgradeStrategy `json:"upgradeStrategy"`
// Deployment 更新策略
// +optional
DeploymentUpdateStrategy appsv1.DeploymentStrategy `json:"deploymentUpdateStrategy,omitempty"`
// DaemonSet 更新策略
// +optional
DaemonSetUpdateStrategy appsv1.DaemonSetUpdateStrategy `json:"daemonSetUpdateStrategy,omitempty"`
// ===== 資源配置 =====
// 資源限制
// +optional
Resources v1.ResourceRequirements `json:"resources,omitempty"`
// 環境變數
// +optional
Env []v1.EnvVar `json:"env,omitempty"`
// Volume Mounts
// +optional
VolumeMounts []v1.VolumeMount `json:"volumeMounts,omitempty"`
// Volumes
// +optional
Volumes []v1.Volume `json:"volumes,omitempty"`
// ===== 調度配置 =====
// Node Selector
// +optional
NodeSelector map[string]string `json:"nodeSelector,omitempty"`
// Tolerations
// +optional
Tolerations []v1.Toleration `json:"tolerations,omitempty"`
// Affinity
// +optional
Affinity *v1.Affinity `json:"affinity,omitempty"`
// ===== 網路配置 =====
// Ingress 配置
// +optional
Ingress Ingress `json:"ingress,omitempty"`
// Service 端口
// +optional
Ports []PortsSpec `json:"ports,omitempty"`
// ===== 高級功能 =====
// Target Allocator
// +optional
TargetAllocator TargetAllocatorEmbedded `json:"targetAllocator,omitempty"`
// HPA 配置
// +optional
Autoscaler *AutoscalerSpec `json:"autoscaler,omitempty"`
// 探針配置
// +optional
LivenessProbe *Probe `json:"livenessProbe,omitempty"`
ReadinessProbe *Probe `json:"readinessProbe,omitempty"`
// ===== 可觀測性 =====
// Observability 配置
// +optional
Observability ObservabilitySpec `json:"observability,omitempty"`
}
3.2 實戰範例:從簡單到複雜
3.2.1 最簡範例
檔案: config/samples/core_v1beta1_opentelemetrycollector.yaml
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: simplest
spec:
config:
receivers:
otlp:
protocols:
grpc: {}
http: {}
exporters:
debug: {}
service:
pipelines:
traces:
receivers: [otlp]
exporters: [debug]
這個 CR 會創建:
# 1. Deployment (1 replica)
kubectl get deployment simplest-collector
# 2. Service (暴露 OTLP 端口)
kubectl get service simplest-collector
# Ports: 4317 (gRPC), 4318 (HTTP)
# 3. Service (headless, 用於 StatefulSet)
kubectl get service simplest-collector-headless
# 4. ConfigMap (Collector 配置)
kubectl get configmap simplest-collector
# 5. ServiceAccount
kubectl get serviceaccount simplest-collector
3.2.2 生產級範例
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: production-collector
labels:
env: production
spec:
# ===== 部署配置 =====
mode: deployment
replicas: 3 # 高可用
image: otel/opentelemetry-collector-k8s:0.88.0
# ===== 升級策略 =====
upgradeStrategy: automatic
deploymentUpdateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
# ===== ConfigMap 版本控制 =====
configVersions: 5 # 保留 5 個版本用於回滾
# ===== Collector 配置 =====
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Prometheus receiver
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 30s
static_configs:
- targets: ['0.0.0.0:8888']
processors:
# 內存限制器(防止 OOM)
memory_limiter:
check_interval: 1s
limit_percentage: 75
spike_limit_percentage: 15
# 批處理(提升性能)
batch:
send_batch_size: 10000
timeout: 10s
# 資源屬性
resource:
attributes:
- key: cluster.name
value: production-cluster
action: upsert
exporters:
# OTLP 導出到後端
otlp:
endpoint: backend.example.com:4317
tls:
insecure: false
cert_file: /certs/tls.crt
key_file: /certs/tls.key
# Prometheus 導出
prometheus:
endpoint: 0.0.0.0:8889
# Health Check Extension
extensions:
health_check:
endpoint: 0.0.0.0:13133
pprof:
endpoint: localhost:1777
service:
extensions: [health_check, pprof]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [otlp]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch]
exporters: [otlp, prometheus]
# ===== 資源限制 =====
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
# ===== 環境變數 =====
env:
- name: MY_POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
# ===== 調度配置 =====
nodeSelector:
workload: telemetry
tolerations:
- key: telemetry
operator: Equal
value: "true"
effect: NoSchedule
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- production-collector-collector
topologyKey: kubernetes.io/hostname
# ===== Ingress 配置 =====
ingress:
type: ingress
hostname: collector.example.com
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
tls:
- secretName: collector-tls
hosts:
- collector.example.com
# ===== HPA 自動擴縮容 =====
autoscaler:
minReplicas: 3
maxReplicas: 10
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 50
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 75
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
# ===== 探針配置 =====
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 15
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 5
# ===== Target Allocator =====
targetAllocator:
enabled: true
replicas: 3
allocationStrategy: consistent-hashing
prometheusCR:
enabled: true
serviceMonitorSelector: {}
podMonitorSelector: {}
# ===== 可觀測性 =====
observability:
metrics:
enableMetrics: true
3.2.3 DaemonSet 模式範例
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: node-collector
spec:
mode: daemonset # 每個節點一個實例
# DaemonSet 更新策略
daemonSetUpdateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
# Host Network(訪問節點級資源)
hostNetwork: true
config:
receivers:
# 主機指標
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
load: {}
memory: {}
disk: {}
filesystem: {}
network: {}
# Kubernetes Events
k8s_events:
namespaces: [default, kube-system]
exporters:
otlp:
endpoint: central-collector:4317
service:
pipelines:
metrics:
receivers: [hostmetrics]
exporters: [otlp]
四、Controller/Reconciler 深度實現
4.1 Reconciler 結構體詳解
檔案: internal/controllers/opentelemetrycollector_controller.go:58-67
type OpenTelemetryCollectorReconciler struct {
// K8s 客戶端(帶緩存)
client.Client
// 事件記錄器(發送 K8s 事件)
recorder record.EventRecorder
// API Scheme(資源類型註冊表)
scheme *runtime.Scheme
// 結構化日誌器
log logr.Logger
// Operator 配置
config config.Config
// RBAC 權限審查器
reviewer *internalRbac.Reviewer
// 版本升級處理器
upgrade *upgrade.VersionUpgrade
}
4.1.1 依賴組件解析
Client.Client
// controller-runtime 提供的智能客戶端
// 特性:
// 1. 自動緩存(減少 API Server 壓力)
// 2. 類型安全
// 3. 支持 FieldIndexer(加速查詢)
// 4. 自動處理 RESTMapper
// 使用範例
err := r.Client.Get(ctx, types.NamespacedName{
Name: "my-collector",
Namespace: "default",
}, &collector)
EventRecorder
// 記錄 Kubernetes 事件
// kubectl describe 時可以看到
r.recorder.Event(
&instance, // 相關物件
corev1.EventTypeNormal, // 事件類型
"Created", // 原因
"Created OpenTelemetry Collector deployment", // 消息
)
Reviewer (RBAC 檢查器)
// 檢查 Collector 配置中的 RBAC 需求
// 例如:如果配置中有 k8s_cluster receiver,需要額外權限
if reviewer.NeedsClusterRole(collectorConfig) {
// 創建 ClusterRole 和 ClusterRoleBinding
}
4.2 Reconcile 完整流程追蹤
檔案: internal/controllers/opentelemetrycollector_controller.go:234-314
讓我們逐步追蹤一個完整的 Reconcile 流程:
func (r *OpenTelemetryCollectorReconciler) Reconcile(
ctx context.Context,
req ctrl.Request,
) (ctrl.Result, error) {
log := r.log.WithValues("opentelemetrycollector", req.NamespacedName)
// ==================== 階段 1: 獲取 CR ====================
log.Info("Reconciling OpenTelemetryCollector")
var instance v1beta1.OpenTelemetryCollector
if err := r.Get(ctx, req.NamespacedName, &instance); err != nil {
if !apierrors.IsNotFound(err) {
log.Error(err, "unable to fetch OpenTelemetryCollector")
}
// 資源已刪除,忽略錯誤
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// ==================== 階段 2: 構建參數 ====================
// 包含所有需要的配置、Target Allocator 等
params, err := r.GetParams(ctx, instance)
if err != nil {
log.Error(err, "Failed to create manifest.Params")
return ctrl.Result{}, err
}
// ==================== 階段 3: 處理刪除(Finalizer 模式)====================
if deletionTimestamp := instance.GetDeletionTimestamp(); deletionTimestamp != nil {
log.Info("Resource is being deleted")
if controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
log.Info("Running finalizer logic")
// 執行清理:刪除 ClusterRole、ClusterRoleBinding 等集群級資源
if err = r.finalizeCollector(ctx, params); err != nil {
log.Error(err, "Failed to finalize")
return ctrl.Result{}, err
}
// 移除 Finalizer
if controllerutil.RemoveFinalizer(&instance, collectorFinalizer) {
err = r.Update(ctx, &instance)
if err != nil {
return ctrl.Result{}, err
}
}
log.Info("Finalizer removed, resource will be deleted")
}
return ctrl.Result{}, nil
}
// ==================== 階段 4: 檢查管理狀態 ====================
// ManagementStateUnmanaged: Operator 不管理此資源
if instance.Spec.ManagementState == v1beta1.ManagementStateUnmanaged {
log.Info("Skipping reconciliation for unmanaged resource")
return ctrl.Result{}, nil
}
// ==================== 階段 5: 版本升級 ====================
// 檢查 CR 配置是否需要升級到新格式
if r.upgrade.NeedsUpgrade(instance) {
log.Info("Upgrading OpenTelemetryCollector configuration")
err = r.upgrade.Upgrade(ctx, instance)
if err != nil {
log.Error(err, "Failed to upgrade")
return ctrl.Result{}, err
}
// CR 被修改,重新排隊觸發新的 reconcile
log.Info("Configuration upgraded, requeuing")
return ctrl.Result{Requeue: true, RequeueAfter: 1 * time.Second}, nil
}
// ==================== 階段 6: 添加 Finalizer ====================
// Finalizer 確保在刪除時執行清理邏輯
if !controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
log.Info("Adding finalizer")
if controllerutil.AddFinalizer(&instance, collectorFinalizer) {
err = r.Update(ctx, &instance)
if err != nil {
return ctrl.Result{}, err
}
}
}
// ==================== 階段 7: 構建期望資源 ====================
log.Info("Building desired objects")
desiredObjects, buildErr := BuildCollector(params)
if buildErr != nil {
log.Error(buildErr, "Failed to build desired objects")
return ctrl.Result{}, buildErr
}
log.Info("Desired objects built", "count", len(desiredObjects))
// ==================== 階段 8: 查找現有資源 ====================
log.Info("Finding owned objects")
ownedObjects, err := r.findOtelOwnedObjects(ctx, params)
if err != nil {
log.Error(err, "Failed to find owned objects")
return ctrl.Result{}, err
}
log.Info("Found owned objects", "count", len(ownedObjects))
// ==================== 階段 9: 調和資源 ====================
// 比較期望狀態和當前狀態,執行創建/更新/刪除
log.Info("Reconciling desired objects")
err = reconcileDesiredObjects(
ctx,
r.Client,
log,
&instance,
params.Scheme,
desiredObjects,
ownedObjects,
)
// ==================== 階段 10: 更新 Status ====================
return collectorStatus.HandleReconcileStatus(ctx, log, params, instance, err)
}
4.2.1 詳細追蹤:BuildCollector 函數
檔案: internal/controllers/opentelemetrycollector_controller.go
func BuildCollector(params manifests.Params) ([]client.Object, error) {
var objects []client.Object
// 1. ServiceAccount
if sa := collector.ServiceAccount(params); sa != nil {
objects = append(objects, sa)
}
// 2. ConfigMap(Collector 配置)
configMaps, err := collector.ConfigMaps(params)
if err != nil {
return nil, err
}
for _, cm := range configMaps {
objects = append(objects, cm)
}
// 3. 根據模式選擇工作負載類型
switch params.OtelCol.Spec.Mode {
case v1alpha1.ModeDeployment:
deployment, err := collector.Deployment(params)
if err != nil {
return nil, err
}
objects = append(objects, deployment)
case v1alpha1.ModeDaemonSet:
daemonset, err := collector.DaemonSet(params)
if err != nil {
return nil, err
}
objects = append(objects, daemonset)
case v1alpha1.ModeStatefulSet:
statefulset, err := collector.StatefulSet(params)
if err != nil {
return nil, err
}
objects = append(objects, statefulset)
}
// 4. Service(暴露端口)
services := collector.Services(params)
for _, svc := range services {
objects = append(objects, svc)
}
// 5. Ingress(如果配置)
if params.OtelCol.Spec.Ingress.Type == v1beta1.IngressTypeIngress {
if ingress := collector.Ingress(params); ingress != nil {
objects = append(objects, ingress)
}
}
// 6. HPA(如果配置自動擴縮容)
if params.OtelCol.Spec.Autoscaler != nil {
if hpa := collector.HorizontalPodAutoscaler(params); hpa != nil {
objects = append(objects, hpa)
}
}
// 7. RBAC(如果需要)
if params.Config.CreateRBACPermissions == rbac.Available {
if clusterRole := collector.ClusterRole(params); clusterRole != nil {
objects = append(objects, clusterRole)
}
if clusterRoleBinding := collector.ClusterRoleBinding(params); clusterRoleBinding != nil {
objects = append(objects, clusterRoleBinding)
}
}
// 8. ServiceMonitor/PodMonitor(可觀測性)
if params.OtelCol.Spec.Observability.Metrics.EnableMetrics {
if sm := collector.ServiceMonitor(params); sm != nil {
objects = append(objects, sm)
}
}
// 9. Target Allocator(如果啟用)
if params.OtelCol.Spec.TargetAllocator.Enabled {
taObjects, err := BuildTargetAllocator(params)
if err != nil {
return nil, err
}
objects = append(objects, taObjects...)
}
return objects, nil
}
4.2.2 調和邏輯:reconcileDesiredObjects
func reconcileDesiredObjects(
ctx context.Context,
client client.Client,
log logr.Logger,
owner client.Object,
scheme *runtime.Scheme,
desired []client.Object,
existing map[types.UID]client.Object,
) error {
// 為期望的資源設置 OwnerReference
for _, obj := range desired {
if err := controllerutil.SetControllerReference(owner, obj, scheme); err != nil {
return err
}
}
// ========== 步驟 1: 創建或更新期望的資源 ==========
for _, desiredObj := range desired {
log := log.WithValues(
"kind", desiredObj.GetObjectKind().GroupVersionKind().Kind,
"name", desiredObj.GetName(),
)
// 嘗試獲取現有資源
existingObj := desiredObj.DeepCopyObject().(client.Object)
err := client.Get(ctx, types.NamespacedName{
Name: desiredObj.GetName(),
Namespace: desiredObj.GetNamespace(),
}, existingObj)
if err != nil && apierrors.IsNotFound(err) {
// 資源不存在,創建
log.Info("Creating resource")
if err := client.Create(ctx, desiredObj); err != nil {
log.Error(err, "Failed to create resource")
return err
}
log.Info("Resource created successfully")
} else if err != nil {
// 其他錯誤
return err
} else {
// 資源存在,更新
log.Info("Updating resource")
// 保留不應該變更的字段
desiredObj.SetResourceVersion(existingObj.GetResourceVersion())
if err := client.Update(ctx, desiredObj); err != nil {
log.Error(err, "Failed to update resource")
return err
}
log.Info("Resource updated successfully")
// 從待刪除列表中移除
delete(existing, existingObj.GetUID())
}
}
// ========== 步驟 2: 刪除多餘的資源 ==========
for uid, obj := range existing {
log := log.WithValues(
"kind", obj.GetObjectKind().GroupVersionKind().Kind,
"name", obj.GetName(),
"uid", uid,
)
log.Info("Deleting orphaned resource")
if err := client.Delete(ctx, obj); err != nil {
log.Error(err, "Failed to delete resource")
return err
}
log.Info("Orphaned resource deleted")
}
return nil
}
4.3 索引與緩存優化
4.3.1 設置 Field Indexer
檔案: internal/controllers/opentelemetrycollector_controller.go:334-354
func (r *OpenTelemetryCollectorReconciler) SetupCaches(cluster cluster.Cluster) error {
ownedResources := r.GetOwnedResourceTypes()
// 為每種資源類型建立索引
for _, resource := range ownedResources {
if err := cluster.GetCache().IndexField(
context.Background(),
resource,
resourceOwnerKey, // 索引鍵:".metadata.owner"
func(rawObj client.Object) []string {
// 提取 Owner Reference
owner := metav1.GetControllerOf(rawObj)
if owner == nil {
return nil
}
// 只索引 OpenTelemetryCollector 擁有的資源
if owner.Kind != "OpenTelemetryCollector" {
return nil
}
// 返回 Owner 名稱作為索引值
return []string{owner.Name}
},
); err != nil {
return err
}
}
return nil
}
4.3.2 使用索引加速查詢
func (r *OpenTelemetryCollectorReconciler) findOtelOwnedObjects(
ctx context.Context,
params manifests.Params,
) (map[types.UID]client.Object, error) {
ownedObjects := map[types.UID]client.Object{}
// 使用索引快速查詢(而不是掃描所有資源)
listOpts := []client.ListOption{
client.InNamespace(params.OtelCol.Namespace),
client.MatchingFields{resourceOwnerKey: params.OtelCol.Name},
}
for _, objectType := range r.GetOwnedResourceTypes() {
objs, err := getList(ctx, r, objectType, listOpts...)
if err != nil {
return nil, err
}
for uid, object := range objs {
ownedObjects[uid] = object
}
}
return ownedObjects, nil
}
性能對比:
| 方法 | 時間複雜度 | 說明 |
| 不使用索引 | O(N) | N = 集群中所有 Deployment 數量 |
| 使用索引 | O(M) | M = 被這個 Collector 擁有的 Deployment 數量 |
在大型集群中,性能提升可達 10-100 倍!
五、Manifest 構建器詳解
5.1 Deployment 構建完整解析
檔案: internal/manifests/collector/deployment.go
func Deployment(params manifests.Params) (*appsv1.Deployment, error) {
name := naming.Collector(params.OtelCol.Name)
// ========== 1. 構建標籤 ==========
// 標籤用於:
// - Service 選擇器
// - HPA 目標選擇
// - 監控發現
labels := manifestutils.Labels(
params.OtelCol.ObjectMeta,
name,
params.OtelCol.Spec.Image,
ComponentOpenTelemetryCollector,
params.Config.LabelsFilter,
)
// 標籤範例:
// app.kubernetes.io/name: my-collector-collector
// app.kubernetes.io/instance: my-collector
// app.kubernetes.io/managed-by: opentelemetry-operator
// app.kubernetes.io/component: opentelemetry-collector
// app.kubernetes.io/version: 0.88.0
// ========== 2. 構建註解 ==========
annotations, err := manifestutils.Annotations(
params.OtelCol,
params.Config.AnnotationsFilter,
)
if err != nil {
return nil, err
}
// Pod 註解(額外添加 ConfigMap 哈希)
podAnnotations, err := manifestutils.PodAnnotations(
params.OtelCol,
params.Config.AnnotationsFilter,
)
if err != nil {
return nil, err
}
// ========== 3. 構建 Deployment ==========
return &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: name,
Namespace: params.OtelCol.Namespace,
Labels: labels,
Annotations: annotations,
},
Spec: appsv1.DeploymentSpec{
// 副本數
Replicas: manifestutils.GetInitialReplicas(params.OtelCol),
// 選擇器(必須與 Pod 標籤匹配)
Selector: &metav1.LabelSelector{
MatchLabels: manifestutils.SelectorLabels(
params.OtelCol.ObjectMeta,
ComponentOpenTelemetryCollector,
),
},
// 更新策略
Strategy: params.OtelCol.Spec.DeploymentUpdateStrategy,
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: labels,
Annotations: podAnnotations,
},
Spec: corev1.PodSpec{
// ServiceAccount
ServiceAccountName: ServiceAccountName(params.OtelCol),
// Init Containers
InitContainers: params.OtelCol.Spec.InitContainers,
// 主容器 + 額外容器
Containers: append(
params.OtelCol.Spec.AdditionalContainers,
Container(params.Config, params.Log, params.OtelCol, true),
),
// Volumes
Volumes: Volumes(params.Config, params.OtelCol),
// 網路配置
DNSPolicy: manifestutils.GetDNSPolicy(...),
DNSConfig: ¶ms.OtelCol.Spec.PodDNSConfig,
HostNetwork: params.OtelCol.Spec.HostNetwork,
// 進程命名空間共享
ShareProcessNamespace: ¶ms.OtelCol.Spec.ShareProcessNamespace,
// 調度配置
Tolerations: params.OtelCol.Spec.Tolerations,
NodeSelector: params.OtelCol.Spec.NodeSelector,
Affinity: params.OtelCol.Spec.Affinity,
TopologySpreadConstraints: params.OtelCol.Spec.TopologySpreadConstraints,
// 安全配置
SecurityContext: params.OtelCol.Spec.PodSecurityContext,
PriorityClassName: params.OtelCol.Spec.PriorityClassName,
// 優雅終止
TerminationGracePeriodSeconds: params.OtelCol.Spec.TerminationGracePeriodSeconds,
},
},
},
}, nil
}
5.2 Container 構建深度解析
檔案: internal/manifests/collector/container.go:28-150
func Container(
cfg config.Config,
logger logr.Logger,
otelcol v1beta1.OpenTelemetryCollector,
addConfig bool,
) corev1.Container {
// ========== 1. 決定鏡像 ==========
image := otelcol.Spec.Image
if len(image) == 0 {
image = cfg.CollectorImage // 使用預設鏡像
}
// ========== 2. 構建端口列表 ==========
ports := getContainerPorts(logger, otelcol)
// 從 Collector 配置中解析端口
// 例如:otlp.protocols.grpc.endpoint: 0.0.0.0:4317
// → ContainerPort{Name: "otlp-grpc", ContainerPort: 4317}
// ========== 3. 構建啟動參數 ==========
var args []string
var volumeMounts []corev1.VolumeMount
// 配置文件始終是第一個參數
if addConfig {
args = append(args, fmt.Sprintf(
"--config=/conf/%s",
cfg.CollectorConfigMapEntry,
))
volumeMounts = append(volumeMounts, corev1.VolumeMount{
Name: naming.ConfigMapVolume(),
MountPath: "/conf",
})
}
// 如果啟用 Target Allocator mTLS
if otelcol.Spec.TargetAllocator.Enabled &&
cfg.CertManagerAvailability == certmanager.Available &&
featuregate.EnableTargetAllocatorMTLS.IsEnabled() {
volumeMounts = append(volumeMounts, corev1.VolumeMount{
Name: naming.TAClientCertificate(otelcol.Name),
MountPath: constants.TACollectorTLSDirPath,
})
}
// 額外的用戶自定義參數(排序保證一致性)
argsMap := otelcol.Spec.Args
if argsMap == nil {
argsMap = map[string]string{}
}
var sortedArgs []string
for k, v := range argsMap {
sortedArgs = append(sortedArgs, fmt.Sprintf("--%s=%s", k, v))
}
sort.Strings(sortedArgs)
args = append(args, sortedArgs...)
// ========== 4. Volume Mounts ==========
if len(otelcol.Spec.VolumeMounts) > 0 {
volumeMounts = append(volumeMounts, otelcol.Spec.VolumeMounts...)
}
// 額外 ConfigMap 掛載
for _, cm := range otelcol.Spec.ConfigMaps {
volumeMounts = append(volumeMounts, corev1.VolumeMount{
Name: naming.ConfigMapExtra(cm.Name),
MountPath: path.Join("/var/conf", cm.MountPath, naming.ConfigMapExtra(cm.Name)),
})
}
// ========== 5. 健康檢查探針 ==========
// Liveness Probe
livenessProbe, err := otelcol.Spec.Config.GetLivenessProbe(logger)
if err != nil {
logger.Error(err, "cannot create liveness probe")
} else {
defaultProbeSettings(livenessProbe, otelcol.Spec.LivenessProbe)
}
// Readiness Probe
readinessProbe, err := otelcol.Spec.Config.GetReadinessProbe(logger)
if err != nil {
logger.Error(err, "cannot create readiness probe")
} else {
defaultProbeSettings(readinessProbe, otelcol.Spec.ReadinessProbe)
}
// Startup Probe
startupProbe, err := otelcol.Spec.Config.GetStartupProbe(logger)
if err != nil {
logger.Error(err, "cannot create startup probe")
}
// ========== 6. 環境變數 ==========
envVars := otelcol.Spec.Env
// 添加預設環境變數
envVars = append(envVars, corev1.EnvVar{
Name: "POD_NAME",
ValueFrom: &corev1.EnvVarSource{
FieldRef: &corev1.ObjectFieldSelector{
FieldPath: "metadata.name",
},
},
})
// ========== 7. 返回完整 Container ==========
return corev1.Container{
Name: naming.Container(),
Image: image,
ImagePullPolicy: otelcol.Spec.ImagePullPolicy,
Args: args,
Ports: ports,
VolumeMounts: volumeMounts,
Env: envVars,
EnvFrom: otelcol.Spec.EnvFrom,
Resources: otelcol.Spec.Resources,
LivenessProbe: livenessProbe,
ReadinessProbe: readinessProbe,
StartupProbe: startupProbe,
SecurityContext: otelcol.Spec.SecurityContext,
}
}
5.2.1 端口解析邏輯
func getContainerPorts(logger logr.Logger, otelcol v1beta1.OpenTelemetryCollector) []corev1.ContainerPort {
var ports []corev1.ContainerPort
// 從 Collector 配置中解析端口
cfg := otelcol.Spec.Config
// 解析 receivers 中的端口
if receivers, ok := cfg.Receivers.Object["otlp"].(map[string]interface{}); ok {
if protocols, ok := receivers["protocols"].(map[string]interface{}); ok {
// gRPC 端口
if grpc, ok := protocols["grpc"].(map[string]interface{}); ok {
if endpoint, ok := grpc["endpoint"].(string); ok {
port := extractPort(endpoint) // "0.0.0.0:4317" → 4317
ports = append(ports, corev1.ContainerPort{
Name: "otlp-grpc",
ContainerPort: port,
Protocol: corev1.ProtocolTCP,
})
}
}
// HTTP 端口
if http, ok := protocols["http"].(map[string]interface{}); ok {
if endpoint, ok := http["endpoint"].(string); ok {
port := extractPort(endpoint) // "0.0.0.0:4318" → 4318
ports = append(ports, corev1.ContainerPort{
Name: "otlp-http",
ContainerPort: port,
Protocol: corev1.ProtocolTCP,
})
}
}
}
}
// 用戶自定義端口
for _, portSpec := range otelcol.Spec.Ports {
ports = append(ports, corev1.ContainerPort{
Name: portSpec.Name,
ContainerPort: portSpec.Port,
Protocol: portSpec.Protocol,
})
}
return ports
}
5.3 ConfigMap 構建與版本控制
檔案: internal/manifests/collector/configmap.go
func ConfigMaps(params manifests.Params) ([]*corev1.ConfigMap, error) {
var configMaps []*corev1.ConfigMap
// ========== 1. 構建主 ConfigMap ==========
name := naming.ConfigMap(params.OtelCol.Name)
// 序列化 Collector 配置為 YAML
configYAML, err := params.OtelCol.Spec.Config.Yaml()
if err != nil {
return nil, err
}
// 計算配置哈希(用於觸發 Pod 重啟)
configHash := hash(configYAML)
configMap := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-%s", name, configHash[:8]),
Namespace: params.OtelCol.Namespace,
Labels: manifestutils.Labels(
params.OtelCol.ObjectMeta,
name,
params.OtelCol.Spec.Image,
ComponentOpenTelemetryCollector,
params.Config.LabelsFilter,
),
Annotations: map[string]string{
"opentelemetry.io/config-hash": configHash,
"opentelemetry.io/created-at": time.Now().Format(time.RFC3339),
},
},
Data: map[string]string{
cfg.CollectorConfigMapEntry: configYAML,
},
}
configMaps = append(configMaps, configMap)
// ========== 2. Target Allocator ConfigMap ==========
if params.OtelCol.Spec.TargetAllocator.Enabled {
taConfigMap, err := BuildTargetAllocatorConfigMap(params)
if err != nil {
return nil, err
}
configMaps = append(configMaps, taConfigMap)
}
return configMaps, nil
}
5.3.1 ConfigMap 版本控制實現
檔案: internal/controllers/opentelemetrycollector_controller.go:146-158
func getCollectorConfigMapsToKeep(
configVersionsToKeep int,
configMaps []*corev1.ConfigMap,
) []*corev1.ConfigMap {
configVersionsToKeep = max(1, configVersionsToKeep)
// 按創建時間排序(最新到最舊)
sort.Slice(configMaps, func(i, j int) bool {
iTime := configMaps[i].GetCreationTimestamp().Time
jTime := configMaps[j].GetCreationTimestamp().Time
return iTime.After(jTime)
})
// 保留最新的 N 個
configMapsToKeep := min(configVersionsToKeep, len(configMaps))
return configMaps[:configMapsToKeep]
}
工作流程:
配置更新 → 創建新 ConfigMap → 舊 ConfigMap 保留(用於回滾)
範例:
1. my-collector-abcd1234 (當前使用)
2. my-collector-efgh5678 (保留)
3. my-collector-ijkl9012 (保留)
4. my-collector-mnop3456 (將被刪除)
六、Webhook 機制深度解析
6.1 Webhook 概述
Kubernetes Admission Webhooks 允許在資源創建/更新前進行攔截和修改。
User: kubectl apply -f pod.yaml
│
▼
API Server
│
├──► Mutating Webhook (修改資源)
│ - Sidecar 注入
│ - 添加標籤/註解
│ - 修改容器配置
│
├──► Validating Webhook (驗證資源)
│ - 檢查配置合法性
│ - 執行業務規則
│
└──► 存儲到 etcd
6.2 Pod Mutation Webhook 實現
檔案: internal/webhook/podmutation/webhookhandler.go
6.2.1 Webhook Handler 結構
type podMutationWebhook struct {
client client.Client // K8s 客戶端
decoder admission.Decoder // 請求解碼器
logger logr.Logger // 日誌
podMutators []PodMutator // Pod 變更器鏈
config config.Config // 配置
}
// PodMutator 介面
type PodMutator interface {
Mutate(ctx context.Context, ns corev1.Namespace, pod corev1.Pod) (corev1.Pod, error)
}
6.2.2 Webhook 註冊
// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=ignore,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,sideEffects=none,admissionReviewVersions=v1
參數解析:
path: Webhook 路徑mutating=true: 變更型 WebhookfailurePolicy=ignore: 失敗時允許請求繼續(高可用)resources=pods: 只攔截 Podverbs=create: 只攔截創建操作
6.2.3 Handle 方法完整實現
func (p *podMutationWebhook) Handle(
ctx context.Context,
req admission.Request,
) admission.Response {
log := p.logger.WithValues("namespace", req.Namespace, "name", req.Name)
log.Info("Webhook called")
// ========== 1. 解碼 Pod ==========
pod := corev1.Pod{}
err := p.decoder.Decode(req, &pod)
if err != nil {
log.Error(err, "Failed to decode pod")
return admission.Errored(http.StatusBadRequest, err)
}
log.Info("Pod decoded", "podName", pod.Name)
// ========== 2. 獲取 Namespace ==========
ns := corev1.Namespace{}
err = p.client.Get(ctx, types.NamespacedName{
Name: req.Namespace,
Namespace: "",
}, &ns)
if err != nil {
log.Error(err, "Failed to get namespace")
res := admission.Errored(http.StatusInternalServerError, err)
// 設置 Allowed = true,即使錯誤也不阻止 Pod 創建
res.Allowed = true
return res
}
// ========== 3. 執行所有 Mutator ==========
originalPod := pod.DeepCopy()
for i, mutator := range p.podMutators {
log := log.WithValues("mutatorIndex", i)
log.Info("Applying mutator")
pod, err = mutator.Mutate(ctx, ns, pod)
if err != nil {
log.Error(err, "Mutator failed")
res := admission.Errored(http.StatusInternalServerError, err)
res.Allowed = true // 錯誤不阻止創建
return res
}
}
// ========== 4. 檢查是否有修改 ==========
if reflect.DeepEqual(originalPod, &pod) {
log.Info("No changes made, allowing")
return admission.Allowed("no changes")
}
// ========== 5. 生成 JSONPatch ==========
marshaledPod, err := json.Marshal(pod)
if err != nil {
log.Error(err, "Failed to marshal pod")
res := admission.Errored(http.StatusInternalServerError, err)
res.Allowed = true
return res
}
log.Info("Returning patch response")
return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)
}
6.3 Sidecar 注入實現
6.3.1 Sidecar Mutator
檔案: internal/webhook/podmutation/sidecar_mutator.go
type sidecarMutator struct {
client client.Client
logger logr.Logger
config config.Config
}
func (s *sidecarMutator) Mutate(
ctx context.Context,
ns corev1.Namespace,
pod corev1.Pod,
) (corev1.Pod, error) {
log := s.logger.WithValues("pod", pod.Name, "namespace", ns.Name)
// ========== 1. 檢查是否需要注入 ==========
// 註解:sidecar.opentelemetry.io/inject
injectAnnotation, exists := pod.Annotations["sidecar.opentelemetry.io/inject"]
if !exists || injectAnnotation == "false" {
log.V(1).Info("Sidecar injection not requested")
return pod, nil
}
log.Info("Sidecar injection requested", "value", injectAnnotation)
// ========== 2. 查找 OpenTelemetryCollector ==========
var collectorName string
if injectAnnotation == "true" {
// 自動查找(尋找同 namespace 下 mode=sidecar 的 Collector)
collectorName, err := s.findSidecarCollector(ctx, ns.Name)
if err != nil {
return pod, err
}
log.Info("Auto-detected collector", "name", collectorName)
} else {
// 使用指定的 Collector
collectorName = injectAnnotation
log.Info("Using specified collector", "name", collectorName)
}
// 獲取 Collector CR
collector := &v1beta1.OpenTelemetryCollector{}
err := s.client.Get(ctx, types.NamespacedName{
Name: collectorName,
Namespace: ns.Name,
}, collector)
if err != nil {
log.Error(err, "Failed to get collector")
return pod, err
}
// 驗證模式
if collector.Spec.Mode != v1alpha1.ModeSidecar {
return pod, fmt.Errorf("collector %s is not in sidecar mode", collectorName)
}
// ========== 3. 注入 Sidecar 容器 ==========
sidecarContainer := s.buildSidecarContainer(collector)
pod.Spec.Containers = append(pod.Spec.Containers, sidecarContainer)
// ========== 4. 添加 Volumes ==========
volumes := s.buildSidecarVolumes(collector)
pod.Spec.Volumes = append(pod.Spec.Volumes, volumes...)
// ========== 5. 修改應用容器環境變數 ==========
// 將 OTLP endpoint 設置為 localhost
for i := range pod.Spec.Containers {
if pod.Spec.Containers[i].Name == sidecarContainer.Name {
continue // 跳過 sidecar 容器本身
}
pod.Spec.Containers[i].Env = append(
pod.Spec.Containers[i].Env,
corev1.EnvVar{
Name: "OTEL_EXPORTER_OTLP_ENDPOINT",
Value: "http://localhost:4317",
},
)
}
log.Info("Sidecar injected successfully")
return pod, nil
}
func (s *sidecarMutator) buildSidecarContainer(
collector *v1beta1.OpenTelemetryCollector,
) corev1.Container {
return corev1.Container{
Name: "otc-sidecar",
Image: collector.Spec.Image,
Args: []string{
"--config=/conf/collector.yaml",
},
Ports: []corev1.ContainerPort{
{Name: "otlp-grpc", ContainerPort: 4317},
{Name: "otlp-http", ContainerPort: 4318},
},
VolumeMounts: []corev1.VolumeMount{
{
Name: "otc-config",
MountPath: "/conf",
},
},
Resources: collector.Spec.Resources,
}
}
6.3.2 Sidecar 注入效果
原始 Pod:
apiVersion: v1
kind: Pod
metadata:
name: myapp
annotations:
sidecar.opentelemetry.io/inject: "true"
spec:
containers:
- name: app
image: myapp:v1.0
注入後的 Pod:
apiVersion: v1
kind: Pod
metadata:
name: myapp
annotations:
sidecar.opentelemetry.io/inject: "true"
sidecar.opentelemetry.io/injected: "true"
spec:
containers:
# 原始應用容器(已修改)
- name: app
image: myapp:v1.0
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: http://localhost:4317
# 注入的 Sidecar 容器
- name: otc-sidecar
image: otel/opentelemetry-collector:0.88.0
args:
- --config=/conf/collector.yaml
ports:
- name: otlp-grpc
containerPort: 4317
- name: otlp-http
containerPort: 4318
volumeMounts:
- name: otc-config
mountPath: /conf
volumes:
- name: otc-config
configMap:
name: my-sidecar-collector
七、開發環境完整設置
7.1 前置工具安裝
7.1.1 Go 語言環境
# ========== macOS ==========
brew install go@1.24
# 設置環境變數
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
# 驗證
go version # 應該顯示 go1.24 或更高
# ========== Linux ==========
# 下載並安裝
wget https://go.dev/dl/go1.24.0.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
sudo tar -C /usr/local -xzf go1.24.0.linux-amd64.tar.gz
# 添加到 PATH(添加到 ~/.bashrc 或 ~/.zshrc)
export PATH=$PATH:/usr/local/go/bin
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin
# 重新加載配置
source ~/.bashrc # 或 source ~/.zshrc
7.1.2 Docker
# ========== macOS ==========
brew install --cask docker
# 或下載 Docker Desktop: https://www.docker.com/products/docker-desktop
# ========== Linux (Ubuntu/Debian) ==========
# 安裝依賴
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release
# 添加 Docker GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
# 設置倉庫
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
# 安裝 Docker Engine
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin
# 允許非 root 用戶使用 Docker
sudo usermod -aG docker $USER
newgrp docker
# 驗證
docker --version
docker run hello-world
7.1.3 kubectl
# ========== macOS ==========
brew install kubectl
# ========== Linux ==========
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
# 驗證
kubectl version --client
7.1.4 Kind (Kubernetes in Docker)
# ========== macOS ==========
brew install kind
# ========== Linux ==========
go install sigs.k8s.io/kind@v0.20.0
# 或使用二進制
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
# 驗證
kind version
7.1.5 Kustomize
# ========== macOS ==========
brew install kustomize
# ========== Linux ==========
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/
# 驗證
kustomize version
7.1.6 controller-gen (Kubebuilder 工具)
# 安裝 controller-gen(生成 CRD 和 deepcopy 代碼)
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest
# 驗證
controller-gen --version
7.1.7 Operator SDK (可選)
# ========== macOS ==========
brew install operator-sdk
# ========== Linux ==========
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; *) echo -n $(uname -m) ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
export OPERATOR_SDK_DL_URL=https://github.com/operator-framework/operator-sdk/releases/download/v1.29.0
curl -LO ${OPERATOR_SDK_DL_URL}/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk
# 驗證
operator-sdk version
7.2 專案設置
7.2.1 Clone 專案
# Clone OpenTelemetry Operator
git clone https://github.com/open-telemetry/opentelemetry-operator.git
cd opentelemetry-operator
# 查看分支
git branch -a
# 切換到最新的穩定分支(如果需要)
git checkout main
7.2.2 了解 Makefile
OpenTelemetry Operator 的 Makefile 提供了豐富的命令:
檔案: Makefile
# ========== 查看所有可用命令 ==========
make help
# 主要命令:
make manifests # 生成 CRD YAML
make generate # 生成 Go 代碼(deepcopy 等)
make fmt # 格式化代碼
make vet # 靜態分析
make test # 運行單元測試
make docker-build # 構建 Docker 鏡像
make install # 安裝 CRD 到集群
make deploy # 部署 Operator 到集群
make run # 本地運行 Operator
make kind-cluster # 創建 Kind 測試集群
make e2e # 運行 E2E 測試
7.2.3 創建 Kind 集群
檔案: Makefile:101
# ========== 使用 Makefile 創建(推薦)==========
make kind-cluster
# 這個命令會:
# 1. 創建一個名為 "otel-operator" 的 Kind 集群
# 2. 使用配置文件 kind-1.33.yaml(或其他版本)
# 3. 自動安裝 cert-manager
# ========== 手動創建 ==========
# 創建集群配置文件
cat <<EOF > kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
image: kindest/node:v1.33.0
- role: worker
image: kindest/node:v1.33.0
- role: worker
image: kindest/node:v1.33.0
EOF
# 創建集群
kind create cluster \
--name otel-operator \
--config kind-config.yaml
# 驗證集群
kubectl cluster-info --context kind-otel-operator
kubectl get nodes
7.2.4 安裝 cert-manager
檔案: Makefile:106 (CERTMANAGER_VERSION)
# ========== 方式 1: 使用 kubectl ==========
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# 等待 cert-manager 就緒
kubectl wait --for=condition=Available --timeout=300s \
deployment/cert-manager -n cert-manager
kubectl wait --for=condition=Available --timeout=300s \
deployment/cert-manager-webhook -n cert-manager
kubectl wait --for=condition=Available --timeout=300s \
deployment/cert-manager-cainjector -n cert-manager
# 驗證
kubectl get pods -n cert-manager
# ========== 方式 2: 使用 Helm ==========
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--create-namespace \
--version v1.13.0 \
--set installCRDs=true
# 驗證
kubectl get pods -n cert-manager
7.3 本地開發工作流
7.3.1 生成 CRD 和代碼
# ========== 1. 生成 CRD YAML ==========
make manifests
# 這會:
# - 從 apis/**/types.go 的註解生成 CRD YAML
# - 輸出到 config/crd/bases/
# - 使用 controller-gen 工具
# 查看生成的 CRD
ls -lh config/crd/bases/
# opentelemetry.io_instrumentations.yaml
# opentelemetry.io_opentelemetrycollectors.yaml
# opentelemetry.io_targetallocators.yaml
# opentelemetry.io_opampbridges.yaml
# ========== 2. 生成 Go 代碼 ==========
make generate
# 這會:
# - 生成 DeepCopy 方法(zz_generated.deepcopy.go)
# - 使用 controller-gen 工具
# 查看生成的代碼
find apis -name "zz_generated.*.go"
背後的命令(來自 Makefile):
# manifests 目標
controller-gen \
crd:generateEmbeddedObjectMeta=true,maxDescLen=0 \
rbac:roleName=manager-role \
webhook \
paths="./..." \
output:crd:artifacts:config=config/crd/bases
# generate 目標
controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
7.3.2 安裝 CRD 到集群
# ========== 安裝 CRD ==========
make install
# 等價於:
# kustomize build config/crd | kubectl apply -f -
# 驗證 CRD 已安裝
kubectl get crd | grep opentelemetry
# 應該看到:
# instrumentations.opentelemetry.io
# opampbridges.opentelemetry.io
# opentelemetrycollectors.opentelemetry.io
# targetallocators.opentelemetry.io
# 查看 CRD 詳情
kubectl explain opentelemetrycollector
kubectl explain opentelemetrycollector.spec
kubectl explain opentelemetrycollector.spec.config
7.3.3 本地運行 Operator
方式 1: 使用 Makefile(推薦)
檔案: Makefile:202-204
# 本地運行(不部署到集群)
make run
# 這會:
# 1. 執行 make generate fmt vet manifests
# 2. 運行 go run ./main.go --zap-devel
# 3. Operator 運行在本地,連接到 Kind 集群
# 默認禁用 Webhooks(本地運行時)
# 如果要啟用 Webhooks:
make run ENABLE_WEBHOOKS=true
方式 2: 直接使用 go run
# 設置環境變數
export WATCH_NAMESPACE=default # 只監控 default namespace
# 或不設置,監控所有 namespace
# 禁用 Webhooks(本地運行通常禁用)
export ENABLE_WEBHOOKS=false
# 運行
go run ./main.go \
--zap-devel \
--metrics-addr=:8080 \
--enable-leader-election=false \
--health-probe-addr=:8081
# 參數說明:
# --zap-devel: 開發模式日誌(更詳細)
# --metrics-addr: Prometheus metrics 地址
# --enable-leader-election: 禁用 leader election(本地單實例)
# --health-probe-addr: 健康檢查地址
查看日誌輸出:
2025-01-15T10:00:00.000Z INFO setup Starting the OpenTelemetry Operator
2025-01-15T10:00:00.001Z INFO setup opentelemetry-operator version {"version": "0.92.0"}
2025-01-15T10:00:00.002Z INFO setup build-date {"date": "2025-01-15"}
2025-01-15T10:00:00.003Z INFO setup go-version {"version": "go1.24.0"}
2025-01-15T10:00:00.010Z INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
2025-01-15T10:00:00.011Z INFO setup starting manager
2025-01-15T10:00:00.011Z INFO controller Starting EventSource {"controller": "opentelemetrycollector", "source": "kind source: *v1beta1.OpenTelemetryCollector"}
2025-01-15T10:00:00.112Z INFO controller Starting Controller {"controller": "opentelemetrycollector"}
2025-01-15T10:00:00.112Z INFO controller Starting workers {"controller": "opentelemetrycollector", "worker count": 1}
7.3.4 測試 Operator
在本地 Operator 運行時,打開另一個終端:
# ========== 創建測試 CR ==========
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: test-local
spec:
mode: deployment
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch: {}
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug]
EOF
# ========== 觀察 Operator 日誌 ==========
# 在運行 make run 的終端,你會看到:
# INFO Reconciling OpenTelemetryCollector {"opentelemetrycollector": "default/test-local"}
# INFO Building desired objects
# INFO Creating resource {"kind": "ServiceAccount", "name": "test-local-collector"}
# INFO Creating resource {"kind": "ConfigMap", "name": "test-local-collector-abcd1234"}
# INFO Creating resource {"kind": "Deployment", "name": "test-local-collector"}
# INFO Creating resource {"kind": "Service", "name": "test-local-collector"}
# ========== 驗證資源 ==========
kubectl get otelcol test-local
kubectl get deployment test-local-collector
kubectl get service test-local-collector
kubectl get configmap -l app.kubernetes.io/instance=test-local
# ========== 查看 Collector 日誌 ==========
kubectl logs -f deployment/test-local-collector
# ========== 測試配置更新 ==========
kubectl patch otelcol test-local --type=merge -p '
{
"spec": {
"config": {
"exporters": {
"debug": {
"verbosity": "normal"
}
}
}
}
}'
# 觀察 Operator 日誌,會看到:
# - 新 ConfigMap 被創建
# - Deployment 被更新(觸發滾動更新)
# ========== 清理 ==========
kubectl delete otelcol test-local
7.3.5 調試技巧
使用 Delve 調試器
# 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest
# 使用 Delve 運行
dlv debug ./main.go -- \
--zap-devel \
--enable-leader-election=false
# 設置斷點
(dlv) break internal/controllers/opentelemetrycollector_controller.go:234
(dlv) continue
# 創建 CR 後,斷點會被觸發
使用 VS Code 調試
創建 .vscode/launch.json:
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug Operator",
"type": "go",
"request": "launch",
"mode": "debug",
"program": "${workspaceFolder}/main.go",
"env": {
"ENABLE_WEBHOOKS": "false"
},
"args": [
"--zap-devel",
"--enable-leader-election=false"
]
}
]
}
按 F5 開始調試,可以設置斷點、查看變數等。
7.3.6 部署到集群(完整流程)
# ========== 1. 構建鏡像 ==========
# 設置鏡像名稱
export IMG=localhost:5000/opentelemetry-operator:dev
make docker-build IMG=$IMG
# ========== 2. 推送到本地 registry(Kind 使用)==========
# 創建本地 registry(如果沒有)
docker run -d -p 5000:5000 --name kind-registry registry:2
# 推送鏡像
docker push $IMG
# 或直接加載到 Kind
kind load docker-image $IMG --name otel-operator
# ========== 3. 部署 Operator ==========
make deploy IMG=$IMG
# 這會:
# 1. 安裝 CRD
# 2. 創建 Namespace (opentelemetry-operator-system)
# 3. 部署 Operator
# 4. 創建 RBAC 資源
# 5. 設置 Webhooks
# ========== 4. 驗證部署 ==========
kubectl get pods -n opentelemetry-operator-system
# 應該看到:
# NAME READY STATUS RESTARTS AGE
# opentelemetry-operator-controller-manager-xxxxxxxxx-xxxxx 2/2 Running 0 1m
# 查看日誌
kubectl logs -f -n opentelemetry-operator-system \
deployment/opentelemetry-operator-controller-manager \
-c manager
# ========== 5. 測試 ==========
kubectl apply -f config/samples/core_v1beta1_opentelemetrycollector.yaml
kubectl get otelcol
kubectl get all -l app.kubernetes.io/managed-by=opentelemetry-operator
# ========== 6. 清理 ==========
make undeploy
7.4 常用開發命令速查
# ========== 代碼生成 ==========
make manifests # 生成 CRD
make generate # 生成 deepcopy 代碼
make bundle # 生成 Operator bundle(OLM)
# ========== 代碼質量 ==========
make fmt # 格式化代碼
make vet # 靜態分析
make lint # Linting(需要 golangci-lint)
make test # 運行單元測試
make test-coverage # 測試覆蓋率
# ========== 構建 ==========
make manager # 構建 Operator 二進制
make targetallocator # 構建 Target Allocator
make operator-opamp-bridge # 構建 OpAMP Bridge
make docker-build # 構建 Docker 鏡像
# ========== 部署 ==========
make install # 安裝 CRD
make uninstall # 卸載 CRD
make deploy # 部署 Operator
make undeploy # 卸載 Operator
make run # 本地運行
# ========== 測試 ==========
make kind-cluster # 創建 Kind 集群
make kind-delete-cluster # 刪除 Kind 集群
make e2e # E2E 測試
make e2e-targetallocator # Target Allocator E2E
make e2e-instrumentation # Instrumentation E2E
# ========== 清理 ==========
make clean # 清理生成的文件
7.5 開發環境故障排查
7.5.1 常見問題
問題 1: controller-gen 未找到
# 錯誤信息
controller-gen: command not found
# 解決方案
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest
問題 2: Kind 集群無法訪問
# 檢查集群
kind get clusters
# 檢查 kubeconfig
kubectl config get-contexts
# 切換 context
kubectl config use-context kind-otel-operator
問題 3: cert-manager 未就緒
# 檢查 cert-manager pods
kubectl get pods -n cert-manager
# 查看日誌
kubectl logs -n cert-manager deployment/cert-manager
# 重新安裝
kubectl delete namespace cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
問題 4: Webhook 錯誤
# 錯誤信息
Error from server (InternalError): error when creating "test.yaml": Internal error occurred: failed calling webhook
# 解決方案(本地運行時)
export ENABLE_WEBHOOKS=false
make run
八、測試策略與實踐
8.1 測試金字塔
/\
/ \
/ E2E \ (少量,關鍵場景)
/------\
/ 集成 \ (中等,API 交互)
/----------\
/ 單元測試 \ (大量,快速反饋)
/--------------\
OpenTelemetry Operator 的測試覆蓋:
單元測試:
internal/**/*_test.go集成測試:
tests/e2e*/**性能測試: 負載測試腳本
8.2 單元測試深度實踐
8.2.1 測試結構
檔案: internal/controllers/reconcile_test.go
這個文件有 56KB,包含大量測試用例!
# 查看測試統計
wc -l internal/controllers/reconcile_test.go
# 約 1500+ 行
# 運行單元測試
make test
# 運行特定包
go test ./internal/controllers -v
# 運行特定測試
go test ./internal/controllers -v -run TestReconcile_Deployment
8.2.2 編寫單元測試範例
創建測試文件: internal/controllers/my_feature_test.go
package controllers_test
import (
"context"
"testing"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client/fake"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
"github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
"github.com/open-telemetry/opentelemetry-operator/internal/controllers"
)
func TestReconcile_CustomFeature(t *testing.T) {
// ========== 準備階段 ==========
ctx := context.Background()
// 創建測試用的 OpenTelemetryCollector
otelCol := &v1beta1.OpenTelemetryCollector{
ObjectMeta: metav1.ObjectMeta{
Name: "test-collector",
Namespace: "default",
},
Spec: v1beta1.OpenTelemetryCollectorSpec{
Mode: v1alpha1.ModeDeployment,
Config: v1beta1.Config{
Receivers: v1beta1.AnyConfig{
Object: map[string]interface{}{
"otlp": map[string]interface{}{
"protocols": map[string]interface{}{
"grpc": map[string]interface{}{
"endpoint": "0.0.0.0:4317",
},
},
},
},
},
Exporters: v1beta1.AnyConfig{
Object: map[string]interface{}{
"debug": map[string]interface{}{},
},
},
Service: v1beta1.Service{
Pipelines: map[string]*v1beta1.Pipeline{
"traces": {
Receivers: []string{"otlp"},
Exporters: []string{"debug"},
},
},
},
},
},
}
// 創建 fake client
scheme := runtime.NewScheme()
_ = v1beta1.AddToScheme(scheme)
_ = appsv1.AddToScheme(scheme)
_ = corev1.AddToScheme(scheme)
k8sClient := fake.NewClientBuilder().
WithScheme(scheme).
WithObjects(otelCol).
WithStatusSubresource(&v1beta1.OpenTelemetryCollector{}).
Build()
// 創建 Reconciler
reconciler := &controllers.OpenTelemetryCollectorReconciler{
Client: k8sClient,
Scheme: scheme,
log: logr.Discard(),
recorder: record.NewFakeRecorder(10),
config: config.New(),
}
// ========== 執行階段 ==========
req := reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "test-collector",
Namespace: "default",
},
}
result, err := reconciler.Reconcile(ctx, req)
// ========== 驗證階段 ==========
require.NoError(t, err)
assert.False(t, result.Requeue)
// 驗證 Deployment
deployment := &appsv1.Deployment{}
err = k8sClient.Get(ctx, types.NamespacedName{
Name: "test-collector-collector",
Namespace: "default",
}, deployment)
require.NoError(t, err)
// 驗證副本數
assert.Equal(t, int32(1), *deployment.Spec.Replicas)
// 驗證容器
assert.Len(t, deployment.Spec.Template.Spec.Containers, 1)
container := deployment.Spec.Template.Spec.Containers[0]
assert.Equal(t, "otc-container", container.Name)
// 驗證端口
require.Len(t, container.Ports, 1)
assert.Equal(t, "otlp-grpc", container.Ports[0].Name)
assert.Equal(t, int32(4317), container.Ports[0].ContainerPort)
// 驗證 Service
service := &corev1.Service{}
err = k8sClient.Get(ctx, types.NamespacedName{
Name: "test-collector-collector",
Namespace: "default",
}, service)
require.NoError(t, err)
assert.Len(t, service.Spec.Ports, 1)
// 驗證 ConfigMap
configMap := &corev1.ConfigMap{}
configMaps := &corev1.ConfigMapList{}
err = k8sClient.List(ctx, configMaps,
client.InNamespace("default"),
client.MatchingLabels{"app.kubernetes.io/instance": "test-collector"},
)
require.NoError(t, err)
assert.Len(t, configMaps.Items, 1)
// 驗證 Status
err = k8sClient.Get(ctx, types.NamespacedName{
Name: "test-collector",
Namespace: "default",
}, otelCol)
require.NoError(t, err)
assert.NotEmpty(t, otelCol.Status.Version)
}
// Table-driven 測試範例
func TestReconcile_DifferentModes(t *testing.T) {
tests := []struct {
name string
mode v1alpha1.Mode
replicas *int32
wantType string
}{
{
name: "Deployment mode",
mode: v1alpha1.ModeDeployment,
replicas: ptr.To(int32(3)),
wantType: "Deployment",
},
{
name: "DaemonSet mode",
mode: v1alpha1.ModeDaemonSet,
replicas: nil,
wantType: "DaemonSet",
},
{
name: "StatefulSet mode",
mode: v1alpha1.ModeStatefulSet,
replicas: ptr.To(int32(2)),
wantType: "StatefulSet",
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
// 測試邏輯...
})
}
}
8.2.3 運行測試
# ========== 運行所有測試 ==========
make test
# ========== 運行特定包 ==========
go test ./internal/controllers -v
# ========== 運行特定測試 ==========
go test ./internal/controllers -v -run TestReconcile_Deployment
# ========== 運行測試並查看覆蓋率 ==========
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out
# ========== 並行運行測試 ==========
go test ./... -parallel=4
# ========== 運行測試並顯示詳細輸出 ==========
go test ./internal/controllers -v -count=1
# ========== 測試特定功能 ==========
go test ./internal/manifests/collector -v -run TestDeployment
8.3 E2E 測試深度實踐
8.3.1 E2E 測試架構
OpenTelemetry Operator 使用 Chainsaw 進行 E2E 測試。
測試目錄結構:
tests/
├── e2e/ # 基礎功能
│ ├── smoke/ # 煙霧測試
│ ├── smoke-targetallocator/ # TA 煙霧測試
│ └── smoke-sidecar/ # Sidecar 測試
├── e2e-instrumentation/ # 自動埋點測試
├── e2e-targetallocator/ # Target Allocator
├── e2e-opampbridge/ # OpAMP Bridge
└── e2e-upgrade/ # 升級測試
8.3.2 Chainsaw 測試範例
檔案: tests/e2e/smoke/00-install.yaml
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: simplest
spec:
config:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
exporters: [debug]
檔案: tests/e2e/smoke/01-assert.yaml
# 斷言 Deployment 被創建並就緒
apiVersion: apps/v1
kind: Deployment
metadata:
name: simplest-collector
spec:
replicas: 1
status:
readyReplicas: 1
availableReplicas: 1
---
# 斷言 Service 被創建
apiVersion: v1
kind: Service
metadata:
name: simplest-collector
spec:
ports:
- name: otlp-grpc
port: 4317
protocol: TCP
targetPort: 4317
檔案: tests/e2e/smoke/chainsaw-test.yaml
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: smoke
spec:
steps:
- name: Install collector
try:
- apply:
file: 00-install.yaml
- assert:
file: 01-assert.yaml
catch:
- describe:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
- describe:
apiVersion: apps/v1
kind: Deployment
- podLogs:
selector: app.kubernetes.io/name=simplest-collector
8.3.3 運行 E2E 測試
# ========== 1. 創建測試集群 ==========
make kind-cluster
# ========== 2. 構建並載入鏡像 ==========
make docker-build
make docker-build-targetallocator
make docker-build-operator-opamp-bridge
# 載入鏡像到 Kind
kind load docker-image \
ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest \
--name otel-operator
# ========== 3. 部署 Operator ==========
make deploy IMG=ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest
# ========== 4. 運行 E2E 測試 ==========
make e2e
# 運行特定測試
make e2e-instrumentation
make e2e-targetallocator
make e2e-upgrade
# ========== 5. 查看測試結果 ==========
# Chainsaw 會輸出詳細的測試報告
# ========== 6. 清理 ==========
make kind-delete-cluster
8.3.4 編寫自定義 E2E 測試
創建測試目錄:
mkdir -p tests/e2e-custom-feature
cd tests/e2e-custom-feature
創建測試清單 (00-install.yaml):
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: custom-test
spec:
mode: deployment
replicas: 2
config:
receivers:
otlp:
protocols:
grpc: {}
processors:
batch:
send_batch_size: 1000
timeout: 10s
exporters:
debug: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug]
創建斷言 (01-assert.yaml):
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-test-collector
spec:
replicas: 2
status:
readyReplicas: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
name: (custom-test-collector-*)
data:
collector.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
創建 Chainsaw 測試 (chainsaw-test.yaml):
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: custom-feature
spec:
description: Test custom feature
timeouts:
apply: 30s
assert: 60s
steps:
- name: Install collector
try:
- apply:
file: 00-install.yaml
- assert:
file: 01-assert.yaml
timeout: 2m
catch:
- describe:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
- podLogs:
selector: app.kubernetes.io/name=custom-test-collector
- name: Test configuration update
try:
- patch:
resource:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: custom-test
merge:
spec:
replicas: 3
- sleep:
duration: 10s
- assert:
resource:
apiVersion: apps/v1
kind: Deployment
metadata:
name: custom-test-collector
spec:
replicas: 3
- name: Cleanup
try:
- delete:
ref:
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
name: custom-test
運行自定義測試:
# 安裝 Chainsaw
go install github.com/kyverno/chainsaw@latest
# 運行測試
chainsaw test --test-dir ./tests/e2e-custom-feature
8.4 測試覆蓋率
# ========== 生成覆蓋率報告 ==========
go test ./... -coverprofile=coverage.out
# ========== 查看覆蓋率摘要 ==========
go tool cover -func=coverage.out
# 輸出範例:
# github.com/open-telemetry/opentelemetry-operator/internal/controllers/opentelemetrycollector_controller.go:234: Reconcile 85.7%
# github.com/open-telemetry/opentelemetry-operator/internal/manifests/collector/deployment.go:17: Deployment 92.3%
# ========== 生成 HTML 報告 ==========
go tool cover -html=coverage.out -o coverage.html
# 在瀏覽器中打開
open coverage.html # macOS
xdg-open coverage.html # Linux
# ========== 按包查看覆蓋率 ==========
go test ./internal/controllers -coverprofile=controllers.out
go tool cover -func=controllers.out
8.5 性能測試與基準測試
8.5.1 Benchmark 測試
創建: internal/controllers/reconcile_bench_test.go
package controllers_test
import (
"context"
"testing"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/client/fake"
"sigs.k8s.io/controller-runtime/pkg/reconcile"
"github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
)
func BenchmarkReconcile(b *testing.B) {
// 準備測試數據
otelCol := &v1beta1.OpenTelemetryCollector{
ObjectMeta: metav1.ObjectMeta{
Name: "benchmark-test",
Namespace: "default",
},
Spec: v1beta1.OpenTelemetryCollectorSpec{
Mode: v1alpha1.ModeDeployment,
Config: v1beta1.Config{
// ... 配置
},
},
}
k8sClient := fake.NewClientBuilder().
WithObjects(otelCol).
Build()
reconciler := &controllers.OpenTelemetryCollectorReconciler{
Client: k8sClient,
// ... 其他字段
}
req := reconcile.Request{
NamespacedName: types.NamespacedName{
Name: "benchmark-test",
Namespace: "default",
},
}
ctx := context.Background()
// 運行 benchmark
b.ResetTimer()
for i := 0; i < b.N; i++ {
_, err := reconciler.Reconcile(ctx, req)
if err != nil {
b.Fatal(err)
}
}
}
運行 Benchmark:
# 運行 benchmark
go test -bench=. ./internal/controllers
# 輸出範例:
# BenchmarkReconcile-8 1000 1234567 ns/op
# 帶內存分析
go test -bench=. -benchmem ./internal/controllers
# 輸出範例:
# BenchmarkReconcile-8 1000 1234567 ns/op 123456 B/op 1234 allocs/op
# 生成 CPU profile
go test -bench=. -cpuprofile=cpu.prof ./internal/controllers
# 分析 profile
go tool pprof cpu.prof
九、實戰專案:Nginx Operator
在本章中,我們將從零開始實作一個完整的 Nginx Operator,應用前面學到的所有概念。
9.1 專案目標與架構
9.1.1 功能需求
我們的 Nginx Operator 需要支援:
自動部署 Nginx:根據 CR 創建 Deployment
配置管理:支援自定義 nginx.conf
服務暴露:自動創建 Service
配置熱更新:ConfigMap 變更後自動觸發滾動更新
版本管理:支援 Nginx 版本升級
健康檢查:配置 liveness 和 readiness probe
9.1.2 CRD 設計
API 定義 (apis/v1alpha1/nginx_types.go):
package v1alpha1
import (
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
// NginxSpec defines the desired state of Nginx
type NginxSpec struct {
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=10
// +kubebuilder:default=1
Replicas *int32 `json:"replicas,omitempty"`
// +kubebuilder:validation:Pattern=`^nginx:.*$`
// +kubebuilder:default="nginx:1.25"
Image string `json:"image,omitempty"`
// +optional
Resources corev1.ResourceRequirements `json:"resources,omitempty"`
// +optional
Config string `json:"config,omitempty"`
// +optional
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
// +kubebuilder:default=80
Port int32 `json:"port,omitempty"`
// +optional
// +kubebuilder:validation:Enum=ClusterIP;NodePort;LoadBalancer
// +kubebuilder:default=ClusterIP
ServiceType corev1.ServiceType `json:"serviceType,omitempty"`
}
// NginxStatus defines the observed state of Nginx
type NginxStatus struct {
// Conditions represent the latest available observations of the Nginx state
// +optional
Conditions []metav1.Condition `json:"conditions,omitempty"`
// ReadyReplicas is the number of ready replicas
// +optional
ReadyReplicas int32 `json:"readyReplicas,omitempty"`
// ConfigVersion tracks the version of the ConfigMap
// +optional
ConfigVersion string `json:"configVersion,omitempty"`
// LastUpdateTime is the last time the status was updated
// +optional
LastUpdateTime *metav1.Time `json:"lastUpdateTime,omitempty"`
}
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.readyReplicas
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`
// Nginx is the Schema for the nginxes API
type Nginx struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec NginxSpec `json:"spec,omitempty"`
Status NginxStatus `json:"status,omitempty"`
}
// +kubebuilder:object:root=true
// NginxList contains a list of Nginx
type NginxList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []Nginx `json:"items"`
}
func init() {
SchemeBuilder.Register(&Nginx{}, &NginxList{})
}
關鍵註解說明:
+kubebuilder:subresource:status:啟用 status subresource+kubebuilder:subresource:scale:支援kubectl scale命令+kubebuilder:printcolumn:自定義kubectl get輸出列+kubebuilder:validation:欄位驗證規則
9.2 Controller 實作
9.2.1 Reconciler 主邏輯
Controller (internal/controller/nginx_controller.go):
package controller
import (
"context"
"crypto/sha256"
"fmt"
"time"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
apierrors "k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
"sigs.k8s.io/controller-runtime/pkg/log"
webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)
const (
nginxFinalizer = "nginx.webapp.example.com/finalizer"
// Condition types
TypeAvailable = "Available"
TypeProgressing = "Progressing"
TypeDegraded = "Degraded"
)
// NginxReconciler reconciles a Nginx object
type NginxReconciler struct {
client.Client
Scheme *runtime.Scheme
}
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
logger.Info("Reconciling Nginx", "namespace", req.Namespace, "name", req.Name)
// 1. 獲取 Nginx 資源
nginx := &webappv1alpha1.Nginx{}
if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
if apierrors.IsNotFound(err) {
logger.Info("Nginx resource not found, likely deleted")
return ctrl.Result{}, nil
}
logger.Error(err, "Failed to get Nginx")
return ctrl.Result{}, err
}
// 2. 處理刪除邏輯 (Finalizer)
if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
return r.reconcileDelete(ctx, nginx)
}
// 3. 添加 Finalizer
if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
controllerutil.AddFinalizer(nginx, nginxFinalizer)
if err := r.Update(ctx, nginx); err != nil {
logger.Error(err, "Failed to add finalizer")
return ctrl.Result{}, err
}
return ctrl.Result{Requeue: true}, nil
}
// 4. Reconcile ConfigMap
configMap, configVersion, err := r.reconcileConfigMap(ctx, nginx)
if err != nil {
r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ConfigMapFailed", err.Error())
_ = r.Status().Update(ctx, nginx)
return ctrl.Result{}, err
}
// 5. Reconcile Deployment
if err := r.reconcileDeployment(ctx, nginx, configVersion); err != nil {
r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "DeploymentFailed", err.Error())
_ = r.Status().Update(ctx, nginx)
return ctrl.Result{}, err
}
// 6. Reconcile Service
if err := r.reconcileService(ctx, nginx); err != nil {
r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ServiceFailed", err.Error())
_ = r.Status().Update(ctx, nginx)
return ctrl.Result{}, err
}
// 7. 更新 Status
if err := r.updateStatus(ctx, nginx, configVersion); err != nil {
logger.Error(err, "Failed to update status")
return ctrl.Result{}, err
}
// 8. 設置成功 Condition
r.setCondition(nginx, TypeAvailable, metav1.ConditionTrue, "ReconcileSuccess", "Nginx is available")
r.setCondition(nginx, TypeDegraded, metav1.ConditionFalse, "ReconcileSuccess", "")
if err := r.Status().Update(ctx, nginx); err != nil {
return ctrl.Result{}, err
}
logger.Info("Successfully reconciled Nginx")
return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}
// reconcileConfigMap 管理 ConfigMap
func (r *NginxReconciler) reconcileConfigMap(ctx context.Context, nginx *webappv1alpha1.Nginx) (*corev1.ConfigMap, string, error) {
logger := log.FromContext(ctx)
// 默認 nginx 配置
defaultConfig := `
events {
worker_connections 1024;
}
http {
server {
listen 80;
location / {
root /usr/share/nginx/html;
index index.html;
}
}
}`
config := nginx.Spec.Config
if config == "" {
config = defaultConfig
}
// 計算配置的版本(hash)
hash := sha256.Sum256([]byte(config))
configVersion := fmt.Sprintf("%x", hash[:8])
configMap := &corev1.ConfigMap{
ObjectMeta: metav1.ObjectMeta{
Name: nginx.Name + "-config",
Namespace: nginx.Namespace,
Labels: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
"version": configVersion,
},
},
Data: map[string]string{
"nginx.conf": config,
},
}
// 設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, configMap, r.Scheme); err != nil {
return nil, "", err
}
// 創建或更新 ConfigMap
existing := &corev1.ConfigMap{}
err := r.Get(ctx, types.NamespacedName{
Name: configMap.Name,
Namespace: configMap.Namespace,
}, existing)
if err != nil {
if apierrors.IsNotFound(err) {
logger.Info("Creating ConfigMap", "name", configMap.Name)
if err := r.Create(ctx, configMap); err != nil {
return nil, "", err
}
return configMap, configVersion, nil
}
return nil, "", err
}
// 更新 ConfigMap
existing.Data = configMap.Data
existing.Labels = configMap.Labels
logger.Info("Updating ConfigMap", "name", configMap.Name)
if err := r.Update(ctx, existing); err != nil {
return nil, "", err
}
return existing, configVersion, nil
}
// reconcileDeployment 管理 Deployment
func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
logger := log.FromContext(ctx)
replicas := int32(1)
if nginx.Spec.Replicas != nil {
replicas = *nginx.Spec.Replicas
}
image := "nginx:1.25"
if nginx.Spec.Image != "" {
image = nginx.Spec.Image
}
port := int32(80)
if nginx.Spec.Port != 0 {
port = nginx.Spec.Port
}
deployment := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: nginx.Name,
Namespace: nginx.Namespace,
Labels: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
},
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
"config-version": configVersion, // 配置版本變更時觸發滾動更新
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "nginx",
Image: image,
Ports: []corev1.ContainerPort{
{
Name: "http",
ContainerPort: port,
Protocol: corev1.ProtocolTCP,
},
},
VolumeMounts: []corev1.VolumeMount{
{
Name: "config",
MountPath: "/etc/nginx/nginx.conf",
SubPath: "nginx.conf",
},
},
Resources: nginx.Spec.Resources,
LivenessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/",
Port: intstr.FromInt(int(port)),
},
},
InitialDelaySeconds: 10,
PeriodSeconds: 10,
},
ReadinessProbe: &corev1.Probe{
ProbeHandler: corev1.ProbeHandler{
HTTPGet: &corev1.HTTPGetAction{
Path: "/",
Port: intstr.FromInt(int(port)),
},
},
InitialDelaySeconds: 5,
PeriodSeconds: 5,
},
},
},
Volumes: []corev1.Volume{
{
Name: "config",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: nginx.Name + "-config",
},
},
},
},
},
},
},
},
}
// 設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
return err
}
// 創建或更新 Deployment
existing := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{
Name: deployment.Name,
Namespace: deployment.Namespace,
}, existing)
if err != nil {
if apierrors.IsNotFound(err) {
logger.Info("Creating Deployment", "name", deployment.Name)
return r.Create(ctx, deployment)
}
return err
}
// 更新 Deployment
existing.Spec = deployment.Spec
logger.Info("Updating Deployment", "name", deployment.Name)
return r.Update(ctx, existing)
}
// reconcileService 管理 Service
func (r *NginxReconciler) reconcileService(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
logger := log.FromContext(ctx)
port := int32(80)
if nginx.Spec.Port != 0 {
port = nginx.Spec.Port
}
serviceType := corev1.ServiceTypeClusterIP
if nginx.Spec.ServiceType != "" {
serviceType = nginx.Spec.ServiceType
}
service := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: nginx.Name,
Namespace: nginx.Namespace,
Labels: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
},
},
Spec: corev1.ServiceSpec{
Type: serviceType,
Selector: map[string]string{
"app": "nginx",
"nginx": nginx.Name,
},
Ports: []corev1.ServicePort{
{
Name: "http",
Protocol: corev1.ProtocolTCP,
Port: port,
TargetPort: intstr.FromInt(int(port)),
},
},
},
}
// 設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, service, r.Scheme); err != nil {
return err
}
// 創建或更新 Service
existing := &corev1.Service{}
err := r.Get(ctx, types.NamespacedName{
Name: service.Name,
Namespace: service.Namespace,
}, existing)
if err != nil {
if apierrors.IsNotFound(err) {
logger.Info("Creating Service", "name", service.Name)
return r.Create(ctx, service)
}
return err
}
// 更新 Service(保留 ClusterIP)
existing.Spec.Ports = service.Spec.Ports
existing.Spec.Selector = service.Spec.Selector
existing.Spec.Type = service.Spec.Type
logger.Info("Updating Service", "name", service.Name)
return r.Update(ctx, existing)
}
// updateStatus 更新 Nginx 狀態
func (r *NginxReconciler) updateStatus(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
// 獲取 Deployment 狀態
deployment := &appsv1.Deployment{}
err := r.Get(ctx, types.NamespacedName{
Name: nginx.Name,
Namespace: nginx.Namespace,
}, deployment)
if err != nil {
return err
}
// 更新狀態
nginx.Status.ReadyReplicas = deployment.Status.ReadyReplicas
nginx.Status.ConfigVersion = configVersion
now := metav1.Now()
nginx.Status.LastUpdateTime = &now
// 設置 Progressing condition
if deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
r.setCondition(nginx, TypeProgressing, metav1.ConditionTrue, "Updating", "Deployment is being updated")
} else {
r.setCondition(nginx, TypeProgressing, metav1.ConditionFalse, "Updated", "All replicas are updated")
}
return r.Status().Update(ctx, nginx)
}
// reconcileDelete 處理刪除邏輯
func (r *NginxReconciler) reconcileDelete(ctx context.Context, nginx *webappv1alpha1.Nginx) (ctrl.Result, error) {
logger := log.FromContext(ctx)
logger.Info("Deleting Nginx", "name", nginx.Name)
if controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
// 執行清理邏輯(例如清理外部資源)
logger.Info("Performing cleanup for Nginx", "name", nginx.Name)
// 移除 finalizer
controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
if err := r.Update(ctx, nginx); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
// setCondition 設置 Condition
func (r *NginxReconciler) setCondition(nginx *webappv1alpha1.Nginx, condType string, status metav1.ConditionStatus, reason, message string) {
condition := metav1.Condition{
Type: condType,
Status: status,
Reason: reason,
Message: message,
LastTransitionTime: metav1.Now(),
ObservedGeneration: nginx.Generation,
}
meta.SetStatusCondition(&nginx.Status.Conditions, condition)
}
// SetupWithManager sets up the controller with the Manager.
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1alpha1.Nginx{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Owns(&corev1.ConfigMap{}).
Complete(r)
}
9.2.2 核心概念詳解
1. ConfigMap 版本管理
// 計算配置的 SHA256 hash 作為版本號
hash := sha256.Sum256([]byte(config))
configVersion := fmt.Sprintf("%x", hash[:8])
// 將版本號添加到 Pod labels
Labels: map[string]string{
"config-version": configVersion, // 版本變更時自動觸發 rolling update
}
2. Owner Reference 自動清理
// 設置 Owner Reference 後,刪除 Nginx 時會自動刪除子資源
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
return err
}
3. Finalizer 清理邏輯
// 添加 finalizer 防止資源被立即刪除
if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
controllerutil.AddFinalizer(nginx, nginxFinalizer)
if err := r.Update(ctx, nginx); err != nil {
return ctrl.Result{}, err
}
}
// 刪除時執行清理
if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
// 執行清理邏輯
// ...
// 移除 finalizer 允許刪除
controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
if err := r.Update(ctx, nginx); err != nil {
return ctrl.Result{}, err
}
}
9.3 使用範例
9.3.1 部署 Operator
# 1. 生成 CRD 和 RBAC manifests
make manifests
# 2. 安裝 CRD
make install
# 3. 運行 operator(開發模式)
make run
# 或者部署到集群
make docker-build docker-push IMG=your-registry/nginx-operator:v1.0.0
make deploy IMG=your-registry/nginx-operator:v1.0.0
9.3.2 創建 Nginx 實例
基本範例 (config/samples/nginx_basic.yaml):
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
name: nginx-sample
namespace: default
spec:
replicas: 3
image: nginx:1.25
port: 80
serviceType: ClusterIP
自定義配置範例 (config/samples/nginx_custom.yaml):
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
name: nginx-custom
namespace: default
spec:
replicas: 2
image: nginx:1.25
port: 8080
serviceType: LoadBalancer
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
config: |
events {
worker_connections 2048;
}
http {
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';
access_log /var/log/nginx/access.log main;
upstream backend {
server backend-1.example.com;
server backend-2.example.com;
}
server {
listen 8080;
location / {
proxy_pass http://backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /health {
access_log off;
return 200 "healthy\n";
}
}
}
9.3.3 測試和驗證
# 1. 創建 Nginx 實例
kubectl apply -f config/samples/nginx_basic.yaml
# 2. 查看 Nginx 狀態
kubectl get nginx nginx-sample
# 輸出:
# NAME REPLICAS READY IMAGE AGE
# nginx-sample 3 3 nginx:1.25 2m
# 3. 查看詳細狀態
kubectl describe nginx nginx-sample
# 4. 查看生成的資源
kubectl get deployment,svc,cm -l nginx=nginx-sample
# 5. 測試服務
kubectl port-forward svc/nginx-sample 8080:80
curl http://localhost:8080
# 6. 更新配置(觸發滾動更新)
kubectl edit nginx nginx-sample
# 修改 spec.config 或 spec.replicas
# 7. 觀察滾動更新
kubectl rollout status deployment/nginx-sample
# 8. 查看 Conditions
kubectl get nginx nginx-sample -o jsonpath='{.status.conditions}' | jq
# 9. Scale 測試
kubectl scale nginx nginx-sample --replicas=5
kubectl get nginx nginx-sample
# 10. 刪除測試
kubectl delete nginx nginx-sample
# 驗證子資源被自動清理
kubectl get deployment,svc,cm -l nginx=nginx-sample
9.4 進階功能實作
9.4.1 支援 Ingress 自動創建
擴展 CRD:
type NginxSpec struct {
// ... 現有字段
// +optional
Ingress *IngressSpec `json:"ingress,omitempty"`
}
type IngressSpec struct {
// +kubebuilder:validation:Required
Host string `json:"host"`
// +optional
TLS bool `json:"tls,omitempty"`
// +optional
Annotations map[string]string `json:"annotations,omitempty"`
}
Controller 添加 Ingress reconcile:
func (r *NginxReconciler) reconcileIngress(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
if nginx.Spec.Ingress == nil {
return nil
}
pathType := networkingv1.PathTypePrefix
ingress := &networkingv1.Ingress{
ObjectMeta: metav1.ObjectMeta{
Name: nginx.Name,
Namespace: nginx.Namespace,
Annotations: nginx.Spec.Ingress.Annotations,
},
Spec: networkingv1.IngressSpec{
Rules: []networkingv1.IngressRule{
{
Host: nginx.Spec.Ingress.Host,
IngressRuleValue: networkingv1.IngressRuleValue{
HTTP: &networkingv1.HTTPIngressRuleValue{
Paths: []networkingv1.HTTPIngressPath{
{
Path: "/",
PathType: &pathType,
Backend: networkingv1.IngressBackend{
Service: &networkingv1.IngressServiceBackend{
Name: nginx.Name,
Port: networkingv1.ServiceBackendPort{
Number: nginx.Spec.Port,
},
},
},
},
},
},
},
},
},
},
}
if nginx.Spec.Ingress.TLS {
ingress.Spec.TLS = []networkingv1.IngressTLS{
{
Hosts: []string{nginx.Spec.Ingress.Host},
SecretName: nginx.Name + "-tls",
},
}
}
if err := controllerutil.SetControllerReference(nginx, ingress, r.Scheme); err != nil {
return err
}
// 創建或更新邏輯...
return nil
}
9.4.2 支援多端口暴露
擴展 CRD:
type PortSpec struct {
// +kubebuilder:validation:Required
Name string `json:"name"`
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=65535
Port int32 `json:"port"`
// +optional
// +kubebuilder:default="TCP"
Protocol corev1.Protocol `json:"protocol,omitempty"`
}
type NginxSpec struct {
// ... 現有字段
// +optional
AdditionalPorts []PortSpec `json:"additionalPorts,omitempty"`
}
Controller 處理多端口:
func (r *NginxReconciler) buildServicePorts(nginx *webappv1alpha1.Nginx) []corev1.ServicePort {
ports := []corev1.ServicePort{
{
Name: "http",
Port: nginx.Spec.Port,
TargetPort: intstr.FromInt(int(nginx.Spec.Port)),
Protocol: corev1.ProtocolTCP,
},
}
for _, p := range nginx.Spec.AdditionalPorts {
ports = append(ports, corev1.ServicePort{
Name: p.Name,
Port: p.Port,
TargetPort: intstr.FromInt(int(p.Port)),
Protocol: p.Protocol,
})
}
return ports
}
9.5 完整測試範例
9.5.1 單元測試
Controller 測試 (internal/controller/nginx_controller_test.go):
package controller
import (
"context"
"time"
. "github.com/onsi/ginkgo/v2"
. "github.com/onsi/gomega"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)
var _ = Describe("Nginx Controller", func() {
const (
NginxName = "test-nginx"
NginxNamespace = "default"
timeout = time.Second * 10
interval = time.Millisecond * 250
)
Context("When creating a Nginx resource", func() {
It("Should create Deployment, Service, and ConfigMap", func() {
ctx := context.Background()
nginx := &webappv1alpha1.Nginx{
ObjectMeta: metav1.ObjectMeta{
Name: NginxName,
Namespace: NginxNamespace,
},
Spec: webappv1alpha1.NginxSpec{
Replicas: pointer.Int32(2),
Image: "nginx:1.25",
Port: 80,
},
}
Expect(k8sClient.Create(ctx, nginx)).Should(Succeed())
// 驗證 Deployment 被創建
deployment := &appsv1.Deployment{}
Eventually(func() bool {
err := k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, deployment)
return err == nil
}, timeout, interval).Should(BeTrue())
Expect(*deployment.Spec.Replicas).Should(Equal(int32(2)))
Expect(deployment.Spec.Template.Spec.Containers[0].Image).Should(Equal("nginx:1.25"))
// 驗證 Service 被創建
service := &corev1.Service{}
Eventually(func() bool {
err := k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, service)
return err == nil
}, timeout, interval).Should(BeTrue())
Expect(service.Spec.Ports[0].Port).Should(Equal(int32(80)))
// 驗證 ConfigMap 被創建
configMap := &corev1.ConfigMap{}
Eventually(func() bool {
err := k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName + "-config",
Namespace: NginxNamespace,
}, configMap)
return err == nil
}, timeout, interval).Should(BeTrue())
Expect(configMap.Data).Should(HaveKey("nginx.conf"))
})
})
Context("When updating Nginx config", func() {
It("Should trigger rolling update", func() {
ctx := context.Background()
// 獲取當前 Nginx
nginx := &webappv1alpha1.Nginx{}
Expect(k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, nginx)).Should(Succeed())
// 更新配置
nginx.Spec.Config = "# updated config"
Expect(k8sClient.Update(ctx, nginx)).Should(Succeed())
// 驗證 ConfigMap 被更新
configMap := &corev1.ConfigMap{}
Eventually(func() string {
k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName + "-config",
Namespace: NginxNamespace,
}, configMap)
return configMap.Data["nginx.conf"]
}, timeout, interval).Should(ContainSubstring("updated config"))
// 驗證 Deployment Pod template 的 config-version label 改變
deployment := &appsv1.Deployment{}
oldVersion := ""
Eventually(func() bool {
k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, deployment)
newVersion := deployment.Spec.Template.Labels["config-version"]
if oldVersion == "" {
oldVersion = newVersion
return false
}
return newVersion != oldVersion
}, timeout, interval).Should(BeTrue())
})
})
Context("When deleting Nginx", func() {
It("Should cleanup all resources", func() {
ctx := context.Background()
nginx := &webappv1alpha1.Nginx{}
Expect(k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, nginx)).Should(Succeed())
Expect(k8sClient.Delete(ctx, nginx)).Should(Succeed())
// 驗證所有子資源被刪除
Eventually(func() bool {
deployment := &appsv1.Deployment{}
err := k8sClient.Get(ctx, types.NamespacedName{
Name: NginxName,
Namespace: NginxNamespace,
}, deployment)
return errors.IsNotFound(err)
}, timeout, interval).Should(BeTrue())
})
})
})
9.5.2 E2E 測試(Chainsaw)
測試套件 (tests/e2e/nginx/chainsaw-test.yaml):
apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
name: nginx-e2e
spec:
timeouts:
apply: 30s
assert: 1m
cleanup: 30s
steps:
- name: Create Nginx instance
try:
- apply:
file: 00-nginx-sample.yaml
- assert:
file: 01-assert-nginx-ready.yaml
- name: Test service connectivity
try:
- script:
content: |
kubectl run curl-test --image=curlimages/curl:latest --rm -i --restart=Never -- \
curl -s http://nginx-sample.default.svc.cluster.local
check:
($error == null): true
- name: Update configuration
try:
- apply:
file: 02-update-config.yaml
- assert:
file: 03-assert-rolling-update.yaml
- name: Scale up
try:
- script:
content: |
kubectl scale nginx nginx-sample --replicas=5
- assert:
file: 04-assert-scaled.yaml
- name: Cleanup
try:
- delete:
ref:
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
name: nginx-sample
- assert:
file: 05-assert-deleted.yaml
00-nginx-sample.yaml:
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
name: nginx-sample
namespace: default
spec:
replicas: 3
image: nginx:1.25
port: 80
serviceType: ClusterIP
01-assert-nginx-ready.yaml:
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
name: nginx-sample
namespace: default
status:
readyReplicas: 3
conditions:
- type: Available
status: "True"
reason: ReconcileSuccess
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-sample
namespace: default
spec:
replicas: 3
status:
readyReplicas: 3
updatedReplicas: 3
---
apiVersion: v1
kind: Service
metadata:
name: nginx-sample
namespace: default
spec:
type: ClusterIP
ports:
- port: 80
9.6 部署到生產環境
9.6.1 Kustomize 配置
Base (config/default/kustomization.yaml):
namePrefix: nginx-operator-
namespace: nginx-operator-system
resources:
- ../crd
- ../rbac
- ../manager
images:
- name: controller
newName: your-registry/nginx-operator
newTag: v1.0.0
Overlay for Production (config/overlays/production/kustomization.yaml):
bases:
- ../../default
patchesStrategicMerge:
- manager_resources.yaml
configMapGenerator:
- name: manager-config
literals:
- LOG_LEVEL=info
- ENABLE_WEBHOOKS=true
manager_resources.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: controller-manager
namespace: system
spec:
replicas: 3
template:
spec:
containers:
- name: manager
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
9.6.2 部署命令
# 構建鏡像
make docker-build IMG=your-registry/nginx-operator:v1.0.0
# 推送鏡像
make docker-push IMG=your-registry/nginx-operator:v1.0.0
# 部署到生產環境
kubectl apply -k config/overlays/production
# 驗證部署
kubectl get deployment -n nginx-operator-system
kubectl get pods -n nginx-operator-system
# 查看日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager -f
十、進階主題與最佳實踐
10.1 性能優化
10.1.1 使用 Field Indexer 加速查詢
當需要頻繁根據特定字段查詢資源時,Field Indexer 可以大幅提升性能。
範例:根據 Owner 快速查詢 Pod
參考 OpenTelemetry Operator 的實作 (main.go:126-139):
// 為 Pod 添加 owner.name 索引
if err := mgr.GetFieldIndexer().IndexField(
context.Background(),
&corev1.Pod{},
"metadata.ownerReferences.name",
func(rawObj client.Object) []string {
pod := rawObj.(*corev1.Pod)
var owners []string
for _, ref := range pod.GetOwnerReferences() {
owners = append(owners, ref.Name)
}
return owners
},
); err != nil {
setupLog.Error(err, "failed to create pod index")
os.Exit(1)
}
// 使用索引快速查詢
pods := &corev1.PodList{}
err := r.List(ctx, pods, client.MatchingFields{
"metadata.ownerReferences.name": collectorName,
})
更多索引範例:
// 1. 索引 ConfigMap 的標籤
mgr.GetFieldIndexer().IndexField(
context.Background(),
&corev1.ConfigMap{},
"metadata.labels.app",
func(obj client.Object) []string {
cm := obj.(*corev1.ConfigMap)
if app, ok := cm.Labels["app"]; ok {
return []string{app}
}
return nil
},
)
// 查詢
configMaps := &corev1.ConfigMapList{}
r.List(ctx, configMaps, client.MatchingFields{
"metadata.labels.app": "nginx",
})
// 2. 索引 Pod 的 Node
mgr.GetFieldIndexer().IndexField(
context.Background(),
&corev1.Pod{},
"spec.nodeName",
func(obj client.Object) []string {
pod := obj.(*corev1.Pod)
if pod.Spec.NodeName != "" {
return []string{pod.Spec.NodeName}
}
return nil
},
)
// 查詢特定 Node 上的 Pod
pods := &corev1.PodList{}
r.List(ctx, pods, client.MatchingFields{
"spec.nodeName": "node-1",
})
10.1.2 使用 Cache 減少 API 調用
Controller Runtime 自動提供 cache,但需要正確配置:
// main.go - 配置 Cache
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Cache: cache.Options{
// 限制 cache 的 namespace(多租戶場景)
Namespaces: []string{"namespace1", "namespace2"},
// 配置同步週期
SyncPeriod: pointer.Duration(10 * time.Minute),
},
// 配置 metrics 和 health 端點
Metrics: server.Options{
BindAddress: ":8080",
},
HealthProbeBindAddress: ":8081",
})
// Controller 中使用 cache(自動)
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// r.Get 自動從 cache 讀取
nginx := &v1alpha1.Nginx{}
if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
return ctrl.Result{}, err
}
// r.List 也從 cache 讀取
deployments := &appsv1.DeploymentList{}
if err := r.List(ctx, deployments, client.InNamespace(req.Namespace)); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
10.1.3 優化 Reconcile 邏輯
1. 使用 Predicate 過濾不必要的事件
import (
"sigs.k8s.io/controller-runtime/pkg/predicate"
)
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1alpha1.Nginx{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
WithEventFilter(predicate.Funcs{
// 忽略 status-only 更新
UpdateFunc: func(e event.UpdateEvent) bool {
oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
newObj := e.ObjectNew.(*webappv1alpha1.Nginx)
// 只在 spec 或 metadata 改變時 reconcile
return oldObj.Generation != newObj.Generation ||
!reflect.DeepEqual(oldObj.Labels, newObj.Labels) ||
!reflect.DeepEqual(oldObj.Annotations, newObj.Annotations)
},
// 忽略刪除事件(由 finalizer 處理)
DeleteFunc: func(e event.DeleteEvent) bool {
return false
},
}).
Complete(r)
}
2. 批量處理更新
func (r *NginxReconciler) reconcileMultipleResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
// 並行處理多個資源
var wg sync.WaitGroup
errCh := make(chan error, 3)
wg.Add(3)
// ConfigMap
go func() {
defer wg.Done()
if _, _, err := r.reconcileConfigMap(ctx, nginx); err != nil {
errCh <- fmt.Errorf("configmap: %w", err)
}
}()
// Deployment
go func() {
defer wg.Done()
if err := r.reconcileDeployment(ctx, nginx, ""); err != nil {
errCh <- fmt.Errorf("deployment: %w", err)
}
}()
// Service
go func() {
defer wg.Done()
if err := r.reconcileService(ctx, nginx); err != nil {
errCh <- fmt.Errorf("service: %w", err)
}
}()
wg.Wait()
close(errCh)
// 收集錯誤
for err := range errCh {
if err != nil {
return err
}
}
return nil
}
3. 使用 Generation 判斷資源變更
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
nginx := &webappv1alpha1.Nginx{}
if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// 如果 ObservedGeneration 和 Generation 相同,說明 spec 沒變
if nginx.Status.ObservedGeneration == nginx.Generation {
// 只檢查子資源狀態,不重新創建
return r.reconcileStatus(ctx, nginx)
}
// spec 改變了,執行完整 reconcile
return r.fullReconcile(ctx, nginx)
}
10.1.4 限制並發和 Rate Limiting
// main.go - 配置 Controller 選項
func main() {
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
// ...
})
if err = (&controller.NginxReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller")
os.Exit(1)
}
}
// SetupWithManager - 配置並發和限流
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1alpha1.Nginx{}).
WithOptions(controller.Options{
// 最大並發 reconcile 數
MaxConcurrentReconciles: 3,
// Rate Limiter
RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
5*time.Millisecond, // base delay
1000*time.Second, // max delay
),
}).
Complete(r)
}
10.2 安全最佳實踐
10.2.1 RBAC 最小權限原則
僅授予必要權限:
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete
// 不要使用通配符
// 錯誤示範:
// +kubebuilder:rbac:groups=*,resources=*,verbs=*
使用 ServiceAccount 隔離:
# config/rbac/service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: nginx-operator
namespace: nginx-operator-system
---
# config/rbac/role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nginx-operator-manager-rolebinding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: nginx-operator-manager-role
subjects:
- kind: ServiceAccount
name: nginx-operator
namespace: nginx-operator-system
10.2.2 Webhook 安全
配置 TLS 和證書管理:
# config/certmanager/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: nginx-operator-serving-cert
namespace: nginx-operator-system
spec:
dnsNames:
- nginx-operator-webhook-service.nginx-operator-system.svc
- nginx-operator-webhook-service.nginx-operator-system.svc.cluster.local
issuerRef:
kind: Issuer
name: nginx-operator-selfsigned-issuer
secretName: webhook-server-cert
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
name: nginx-operator-selfsigned-issuer
namespace: nginx-operator-system
spec:
selfSigned: {}
Validating Webhook 範例:
// +kubebuilder:webhook:path=/validate-webapp-example-com-v1alpha1-nginx,mutating=false,failurePolicy=fail,groups=webapp.example.com,resources=nginxes,verbs=create;update,versions=v1alpha1,name=vnginx.kb.io,sideEffects=None,admissionReviewVersions=v1
func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
// 驗證 replicas 範圍
if r.Spec.Replicas != nil && (*r.Spec.Replicas < 1 || *r.Spec.Replicas > 10) {
return nil, fmt.Errorf("replicas must be between 1 and 10")
}
// 驗證 image 格式
if !strings.HasPrefix(r.Spec.Image, "nginx:") {
return nil, fmt.Errorf("image must start with 'nginx:'")
}
// 驗證 config 語法(可選)
if r.Spec.Config != "" {
if err := validateNginxConfig(r.Spec.Config); err != nil {
return nil, fmt.Errorf("invalid nginx config: %w", err)
}
}
return nil, nil
}
func (r *Nginx) ValidateUpdate(old runtime.Object) (admission.Warnings, error) {
oldNginx := old.(*Nginx)
// 防止降級(可選規則)
if r.Spec.Image != "" && oldNginx.Spec.Image != "" {
oldVersion := extractVersion(oldNginx.Spec.Image)
newVersion := extractVersion(r.Spec.Image)
if newVersion < oldVersion {
return admission.Warnings{"Downgrading nginx version may cause issues"}, nil
}
}
return r.ValidateCreate()
}
10.2.3 Secret 管理
從 Secret 讀取敏感配置:
func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
// 從 Secret 掛載 TLS 證書
volumes := []corev1.Volume{
{
Name: "config",
VolumeSource: corev1.VolumeSource{
ConfigMap: &corev1.ConfigMapVolumeSource{
LocalObjectReference: corev1.LocalObjectReference{
Name: nginx.Name + "-config",
},
},
},
},
}
volumeMounts := []corev1.VolumeMount{
{
Name: "config",
MountPath: "/etc/nginx/nginx.conf",
SubPath: "nginx.conf",
},
}
// 如果配置了 TLS
if nginx.Spec.TLS != nil {
volumes = append(volumes, corev1.Volume{
Name: "tls",
VolumeSource: corev1.VolumeSource{
Secret: &corev1.SecretVolumeSource{
SecretName: nginx.Spec.TLS.SecretName,
},
},
})
volumeMounts = append(volumeMounts, corev1.VolumeMount{
Name: "tls",
MountPath: "/etc/nginx/tls",
ReadOnly: true,
})
}
// 使用在 Deployment spec 中...
}
加密 etcd 中的 Secret(集群級配置):
# /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
name: kube-apiserver
spec:
containers:
- name: kube-apiserver
command:
- kube-apiserver
- --encryption-provider-config=/etc/kubernetes/encryption-config.yaml
volumeMounts:
- name: encryption-config
mountPath: /etc/kubernetes/encryption-config.yaml
readOnly: true
volumes:
- name: encryption-config
hostPath:
path: /etc/kubernetes/encryption-config.yaml
10.3 可觀測性
10.3.1 結構化日誌
使用 logr 進行結構化日誌記錄:
import (
"sigs.k8s.io/controller-runtime/pkg/log"
)
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
// 基本日誌
logger.Info("Reconciling Nginx",
"namespace", req.Namespace,
"name", req.Name,
)
// 帶錯誤的日誌
if err := r.Get(ctx, req.NamespacedName, &nginx); err != nil {
logger.Error(err, "Failed to get Nginx",
"namespace", req.Namespace,
"name", req.Name,
)
return ctrl.Result{}, err
}
// Debug 級別日誌
logger.V(1).Info("Detailed debug info",
"spec", nginx.Spec,
"status", nginx.Status,
)
// 使用 WithValues 添加上下文
logger = logger.WithValues(
"nginx-version", nginx.Spec.Image,
"replicas", nginx.Spec.Replicas,
)
logger.Info("Starting reconciliation")
return ctrl.Result{}, nil
}
配置日誌級別 (main.go):
import (
"flag"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
)
func main() {
var logLevel int
flag.IntVar(&logLevel, "log-level", 0, "Log level (0=info, 1=debug, 2=trace)")
opts := zap.Options{
Development: true,
Level: zapcore.Level(-logLevel),
}
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
// ...
}
10.3.2 Metrics 暴露
使用 Prometheus metrics:
import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
nginxReconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "nginx_operator_reconcile_total",
Help: "Total number of reconciliations",
},
[]string{"namespace", "name", "result"},
)
nginxReconcileDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "nginx_operator_reconcile_duration_seconds",
Help: "Duration of reconciliations",
Buckets: prometheus.DefBuckets,
},
[]string{"namespace", "name"},
)
nginxCount = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "nginx_operator_nginx_count",
Help: "Number of Nginx instances",
},
[]string{"namespace"},
)
)
func init() {
// 註冊 metrics
metrics.Registry.MustRegister(
nginxReconcileTotal,
nginxReconcileDuration,
nginxCount,
)
}
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
nginxReconcileDuration.WithLabelValues(
req.Namespace,
req.Name,
).Observe(duration)
}()
nginx := &webappv1alpha1.Nginx{}
if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
nginxReconcileTotal.WithLabelValues(
req.Namespace,
req.Name,
"error",
).Inc()
return ctrl.Result{}, client.IgnoreNotFound(err)
}
// ... reconcile 邏輯
nginxReconcileTotal.WithLabelValues(
req.Namespace,
req.Name,
"success",
).Inc()
nginxCount.WithLabelValues(req.Namespace).Set(1)
return ctrl.Result{}, nil
}
配置 ServiceMonitor (Prometheus Operator):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: nginx-operator-metrics
namespace: nginx-operator-system
spec:
selector:
matchLabels:
control-plane: controller-manager
endpoints:
- port: metrics
interval: 30s
path: /metrics
10.3.3 健康檢查
配置 Health 和 Readiness Probes (main.go):
import (
"sigs.k8s.io/controller-runtime/pkg/healthz"
)
func main() {
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
HealthProbeBindAddress: ":8081",
// ...
})
// 添加 health check
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up health check")
os.Exit(1)
}
// 添加 readiness check
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
setupLog.Error(err, "unable to set up ready check")
os.Exit(1)
}
// ...
}
Deployment 配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-operator-controller-manager
spec:
template:
spec:
containers:
- name: manager
image: nginx-operator:latest
ports:
- containerPort: 8080
name: metrics
- containerPort: 8081
name: health
livenessProbe:
httpGet:
path: /healthz
port: health
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: health
initialDelaySeconds: 5
periodSeconds: 10
10.4 生產環境部署
10.4.1 高可用性配置
多副本 + Leader Election:
// main.go
func main() {
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
// 啟用 Leader Election
LeaderElection: true,
LeaderElectionID: "nginx-operator-lock",
LeaderElectionNamespace: "nginx-operator-system",
// 配置 lease 參數
LeaseDuration: pointer.Duration(15 * time.Second),
RenewDeadline: pointer.Duration(10 * time.Second),
RetryPeriod: pointer.Duration(2 * time.Second),
})
// ...
}
Deployment 高可用配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-operator-controller-manager
spec:
replicas: 3
selector:
matchLabels:
control-plane: controller-manager
template:
metadata:
labels:
control-plane: controller-manager
spec:
# Pod 反親和性 - 分散到不同節點
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
control-plane: controller-manager
topologyKey: kubernetes.io/hostname
# 優先級
priorityClassName: system-cluster-critical
containers:
- name: manager
image: nginx-operator:latest
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
# 安全上下文
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true
runAsUser: 65532
seccompProfile:
type: RuntimeDefault
10.4.2 資源限制
配置 Resource Quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: nginx-operator-quota
namespace: nginx-operator-system
spec:
hard:
requests.cpu: "2"
requests.memory: "2Gi"
limits.cpu: "4"
limits.memory: "4Gi"
persistentvolumeclaims: "10"
配置 LimitRange:
apiVersion: v1
kind: LimitRange
metadata:
name: nginx-operator-limitrange
namespace: nginx-operator-system
spec:
limits:
- max:
cpu: "1"
memory: "1Gi"
min:
cpu: "50m"
memory: "64Mi"
default:
cpu: "200m"
memory: "256Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
type: Container
10.4.3 監控告警
Prometheus 告警規則:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: nginx-operator-alerts
namespace: nginx-operator-system
spec:
groups:
- name: nginx-operator
interval: 30s
rules:
- alert: NginxOperatorDown
expr: up{job="nginx-operator-metrics"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx Operator is down"
description: "Nginx Operator has been down for more than 5 minutes"
- alert: NginxOperatorHighReconcileErrors
expr: rate(nginx_operator_reconcile_total{result="error"}[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High reconcile error rate"
description: "Nginx Operator has high reconcile error rate: {{ $value }}"
- alert: NginxInstanceNotReady
expr: nginx_operator_nginx_count{} - nginx_operator_nginx_ready{} > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Nginx instance not ready"
description: "Nginx instance in {{ $labels.namespace }} is not ready for 15 minutes"
10.4.4 備份和恢復
備份 CRD 和 CR:
#!/bin/bash
# backup-nginx-operator.sh
BACKUP_DIR="/backup/nginx-operator/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"
# 備份 CRD
kubectl get crd nginxes.webapp.example.com -o yaml > "$BACKUP_DIR/crd.yaml"
# 備份所有 Nginx CR
kubectl get nginx --all-namespaces -o yaml > "$BACKUP_DIR/nginx-crs.yaml"
# 備份 Operator 配置
kubectl get deployment,service,configmap,secret -n nginx-operator-system -o yaml > "$BACKUP_DIR/operator-resources.yaml"
echo "Backup completed: $BACKUP_DIR"
恢復流程:
#!/bin/bash
# restore-nginx-operator.sh
BACKUP_DIR=$1
if [ -z "$BACKUP_DIR" ]; then
echo "Usage: $0 <backup-directory>"
exit 1
fi
# 1. 恢復 CRD
kubectl apply -f "$BACKUP_DIR/crd.yaml"
# 2. 恢復 Operator
kubectl apply -f "$BACKUP_DIR/operator-resources.yaml"
# 3. 等待 Operator ready
kubectl wait --for=condition=available --timeout=300s \
deployment/nginx-operator-controller-manager -n nginx-operator-system
# 4. 恢復 Nginx CR
kubectl apply -f "$BACKUP_DIR/nginx-crs.yaml"
echo "Restore completed from: $BACKUP_DIR"
10.5 多租戶支援
10.5.1 Namespace 隔離
限制 Operator 監聽特定 Namespace:
// main.go
func main() {
// 從環境變數讀取允許的 namespaces
watchNamespaces := os.Getenv("WATCH_NAMESPACES")
var namespaces []string
if watchNamespaces != "" {
namespaces = strings.Split(watchNamespaces, ",")
}
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Cache: cache.Options{
Namespaces: namespaces, // 只 watch 這些 namespaces
},
})
// ...
}
Deployment 配置:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-operator-controller-manager
spec:
template:
spec:
containers:
- name: manager
env:
- name: WATCH_NAMESPACES
value: "tenant-a,tenant-b,tenant-c"
10.5.2 資源配額
為每個租戶配置 ResourceQuota:
apiVersion: v1
kind: ResourceQuota
metadata:
name: tenant-a-nginx-quota
namespace: tenant-a
spec:
hard:
count/nginxes.webapp.example.com: "10" # 最多 10 個 Nginx 實例
scopeSelector:
matchExpressions:
- operator: In
scopeName: PriorityClass
values: ["tenant-a"]
10.5.3 驗證租戶權限
Validating Webhook 檢查權限:
func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
// 檢查租戶配額
quota, err := getTenantQuota(r.Namespace)
if err != nil {
return nil, err
}
currentCount, err := getNginxCount(r.Namespace)
if err != nil {
return nil, err
}
if currentCount >= quota {
return nil, fmt.Errorf(
"tenant %s has reached quota limit: %d/%d",
r.Namespace, currentCount, quota,
)
}
// 檢查資源限制
if r.Spec.Replicas != nil && *r.Spec.Replicas > quota.MaxReplicas {
return nil, fmt.Errorf(
"replicas %d exceeds tenant limit %d",
*r.Spec.Replicas, quota.MaxReplicas,
)
}
return nil, nil
}
十一、常見問題與調試技巧
11.1 常見錯誤及解決方案
11.1.1 CRD 相關錯誤
錯誤 1:CRD 未安裝
Error: the server could not find the requested resource (get nginxes.webapp.example.com)
解決方案:
# 檢查 CRD 是否存在
kubectl get crd | grep nginx
# 安裝 CRD
make install
# 或手動安裝
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml
# 驗證
kubectl get crd nginxes.webapp.example.com -o yaml
錯誤 2:CRD 版本不匹配
error: error validating "nginx.yaml": error validating data: ValidationError(Nginx.spec): unknown field "newField"
解決方案:
# 1. 更新 CRD 定義
make manifests
# 2. 重新安裝 CRD
kubectl replace -f config/crd/bases/webapp.example.com_nginxes.yaml
# 3. 如果 replace 失敗,需要先刪除(危險操作!會刪除所有 CR)
kubectl delete crd nginxes.webapp.example.com
make install
錯誤 3:Validation 規則不生效
# CRD 中定義了 validation,但創建時沒有被驗證
spec:
replicas: 100 # 超過 maximum: 10 但沒報錯
解決方案:
# 1. 確認 CRD 中包含 validation rules
kubectl get crd nginxes.webapp.example.com -o yaml | grep -A 10 validation
# 2. 重新生成並安裝 CRD
make manifests
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml
# 3. 驗證 validation 生效
kubectl apply -f - <<EOF
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
name: test
spec:
replicas: 100 # 應該報錯
EOF
11.1.2 Controller 錯誤
錯誤 1:Reconcile 死循環
# 日誌顯示不停 reconcile
INFO Reconciling Nginx
INFO Reconciling Nginx
INFO Reconciling Nginx
...
原因分析:
Status 更新觸發了 reconcile
沒有正確設置 Predicate 過濾
Requeue 邏輯錯誤
解決方案:
// 1. 使用 Predicate 過濾 status-only 更新
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&webappv1alpha1.Nginx{}).
WithEventFilter(predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
newObj := e.ObjectNew.(*webappv1alpha1.Nginx)
// 只在 spec 改變時 reconcile
return oldObj.Generation != newObj.Generation
},
}).
Complete(r)
}
// 2. 正確使用 Status().Update()
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// ... reconcile logic
// 使用 Status() subresource 更新,不會觸發新的 reconcile
if err := r.Status().Update(ctx, nginx); err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil // 不要返回 Requeue: true
}
錯誤 2:無法更新 Status
ERROR Failed to update status error="the server could not find the requested resource"
解決方案:
// 確保 CRD 定義包含 status subresource
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status // <-- 這一行很重要!
type Nginx struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec NginxSpec `json:"spec,omitempty"`
Status NginxStatus `json:"status,omitempty"`
}
# 重新生成並安裝 CRD
make manifests
make install
錯誤 3:Owner Reference 錯誤導致資源無法刪除
ERROR Failed to delete Deployment error="cannot delete resource, owner references are set"
解決方案:
// 正確設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
return err
}
// 刪除時會自動清理子資源(Garbage Collection)
// 如果需要手動清理:
func (r *NginxReconciler) cleanupResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
// 列出所有關聯資源
deployments := &appsv1.DeploymentList{}
if err := r.List(ctx, deployments, client.InNamespace(nginx.Namespace), client.MatchingLabels{
"nginx": nginx.Name,
}); err != nil {
return err
}
// 刪除資源
for _, deployment := range deployments.Items {
if err := r.Delete(ctx, &deployment); err != nil {
return err
}
}
return nil
}
11.1.3 RBAC 權限錯誤
錯誤:403 Forbidden
ERROR Failed to list Deployments error="deployments.apps is forbidden: User \"system:serviceaccount:nginx-operator-system:default\" cannot list resource \"deployments\" in API group \"apps\" in the namespace \"default\""
解決方案:
# 1. 檢查 RBAC 註解是否正確
# Controller 代碼中:
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
# 2. 重新生成 RBAC manifests
make manifests
# 3. 檢查生成的 Role/ClusterRole
cat config/rbac/role.yaml
# 4. 重新部署 Operator
kubectl apply -k config/default
# 5. 驗證 ServiceAccount 權限
kubectl auth can-i list deployments \
--as=system:serviceaccount:nginx-operator-system:nginx-operator-controller-manager \
-n default
跨 Namespace 權限問題:
# 如果 Operator 需要操作多個 namespace,需要使用 ClusterRole
# 修改 config/default/kustomization.yaml:
# 註釋掉這一行(使用 Role)
# - ../rbac/role.yaml
# - ../rbac/role_binding.yaml
# 使用 ClusterRole
- ../rbac/cluster_role.yaml
- ../rbac/cluster_role_binding.yaml
11.1.4 Webhook 錯誤
錯誤:Webhook 連接超時
Error from server (InternalError): Internal error occurred: failed calling webhook "vnginx.kb.io": Post "https://nginx-operator-webhook-service.nginx-operator-system.svc:443/validate-webapp-example-com-v1alpha1-nginx?timeout=10s": context deadline exceeded
解決方案:
# 1. 檢查 webhook service 是否存在
kubectl get svc -n nginx-operator-system
# 2. 檢查 webhook pod 是否運行
kubectl get pods -n nginx-operator-system
# 3. 檢查證書是否正確配置
kubectl get secret -n nginx-operator-system | grep webhook
# 4. 檢查 ValidatingWebhookConfiguration
kubectl get validatingwebhookconfiguration
# 5. 查看 webhook 配置詳情
kubectl get validatingwebhookconfiguration nginx-operator-validating-webhook-configuration -o yaml
# 6. 測試 webhook service 連接
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- \
curl -k https://nginx-operator-webhook-service.nginx-operator-system.svc:443
# 7. 如果開發環境不需要 webhook,可以禁用
export ENABLE_WEBHOOKS=false
make run
錯誤:證書驗證失敗
Error: x509: certificate signed by unknown authority
解決方案:
# 1. 確保安裝了 cert-manager
kubectl get pods -n cert-manager
# 如果沒有安裝:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml
# 2. 檢查 Certificate 資源
kubectl get certificate -n nginx-operator-system
# 3. 檢查證書狀態
kubectl describe certificate nginx-operator-serving-cert -n nginx-operator-system
# 4. 手動觸發證書重新生成
kubectl delete certificate nginx-operator-serving-cert -n nginx-operator-system
kubectl apply -f config/certmanager/certificate.yaml
# 5. 等待證書 ready
kubectl wait --for=condition=ready certificate/nginx-operator-serving-cert \
-n nginx-operator-system --timeout=300s
11.2 調試技巧
11.2.1 本地調試
使用 Delve 調試:
# 1. 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest
# 2. 啟動調試模式
dlv debug ./main.go -- --zap-devel
# 3. 設置斷點
(dlv) break internal/controller/nginx_controller.go:60
(dlv) continue
# 4. 查看變量
(dlv) print nginx
(dlv) print err
# 5. 查看調用棧
(dlv) stack
VS Code 調試配置 (.vscode/launch.json):
{
"version": "0.2.0",
"configurations": [
{
"name": "Debug Operator",
"type": "go",
"request": "launch",
"mode": "debug",
"program": "${workspaceFolder}/main.go",
"args": ["--zap-devel"],
"env": {
"ENABLE_WEBHOOKS": "false",
"KUBECONFIG": "${env:HOME}/.kube/config"
}
},
{
"name": "Attach to Process",
"type": "go",
"request": "attach",
"mode": "local",
"processId": "${command:pickProcess}"
}
]
}
11.2.2 日誌分析
增加日誌詳細度:
# 運行時指定 log level
make run -- --zap-log-level=debug
# 或在代碼中:
logger.V(1).Info("Debug message", "key", value) # debug
logger.V(2).Info("Trace message", "key", value) # trace
結構化日誌查詢:
# 使用 jq 過濾日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | jq 'select(.level == "error")'
# 查找特定資源的日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
jq 'select(.nginx == "nginx-sample")'
# 統計錯誤類型
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
jq -r 'select(.level == "error") | .msg' | sort | uniq -c
11.2.3 性能分析
CPU Profiling:
// main.go
import (
"net/http"
_ "net/http/pprof"
)
func main() {
// 啟動 pprof server
go func() {
http.ListenAndServe("localhost:6060", nil)
}()
// ... 其他代碼
}
# 收集 30 秒的 CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# 交互式分析
(pprof) top10
(pprof) list reconcileDeployment
(pprof) web # 生成可視化圖表(需要 graphviz)
Memory Profiling:
# 收集 heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
# 分析內存分配
(pprof) top10
(pprof) list NginxReconciler.Reconcile
# 查看內存洩漏
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap
Goroutine 分析:
# 查看所有 goroutine
curl http://localhost:6060/debug/pprof/goroutine?debug=2
# 分析 goroutine 洩漏
go tool pprof http://localhost:6060/debug/pprof/goroutine
11.2.4 資源狀態檢查
完整的資源檢查腳本:
#!/bin/bash
# debug-nginx.sh - Nginx Operator 診斷腳本
NAMESPACE=${1:-default}
NGINX_NAME=${2:-nginx-sample}
echo "=== Nginx Resource ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o yaml
echo -e "\n=== Nginx Status ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o jsonpath='{.status}' | jq
echo -e "\n=== Nginx Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME
echo -e "\n=== Deployment ==="
kubectl get deployment $NGINX_NAME -n $NAMESPACE -o yaml
echo -e "\n=== Deployment Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME,involvedObject.kind=Deployment
echo -e "\n=== Pods ==="
kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME
echo -e "\n=== Pod Events ==="
for pod in $(kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME -o name); do
echo "Events for $pod:"
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$(basename $pod)
done
echo -e "\n=== Service ==="
kubectl get svc $NGINX_NAME -n $NAMESPACE -o yaml
echo -e "\n=== ConfigMap ==="
kubectl get cm ${NGINX_NAME}-config -n $NAMESPACE -o yaml
echo -e "\n=== Operator Logs (last 50 lines) ==="
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager --tail=50
使用:
chmod +x debug-nginx.sh
./debug-nginx.sh default nginx-sample
11.3 故障排除流程
11.3.1 問題診斷流程圖
1. CR 能否創建成功?
├─ No → 檢查 CRD 安裝、Validation Webhook
└─ Yes → 繼續
2. Operator 能否收到事件?
├─ No → 檢查 Operator 運行狀態、RBAC 權限、Namespace 配置
└─ Yes → 繼續
3. Reconcile 是否執行?
├─ No → 檢查 Predicate 過濾、事件觸發條件
└─ Yes → 繼續
4. 子資源是否創建?
├─ No → 檢查 RBAC、資源定義、錯誤日誌
└─ Yes → 繼續
5. 子資源狀態是否正常?
├─ No → 檢查資源配置、鏡像、探針、資源限制
└─ Yes → 問題解決
11.3.2 常見場景排查
場景 1:Pod 無法啟動
# 1. 查看 Pod 狀態
kubectl get pods -l nginx=nginx-sample
# 2. 查看詳細錯誤
kubectl describe pod <pod-name>
# 3. 查看日誌
kubectl logs <pod-name>
# 常見原因:
# - ImagePullBackOff → 鏡像不存在或無權限
# - CrashLoopBackOff → 容器啟動失敗,檢查配置和日誌
# - Pending → 資源不足或調度失敗
場景 2:配置更新不生效
# 1. 檢查 ConfigMap 是否更新
kubectl get cm nginx-sample-config -o yaml
# 2. 檢查 Deployment 的 config-version label
kubectl get deployment nginx-sample -o jsonpath='{.spec.template.metadata.labels.config-version}'
# 3. 如果 label 沒變,檢查 reconcile 日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | grep nginx-sample
# 4. 手動觸發 reconcile
kubectl annotate nginx nginx-sample force-reconcile="$(date +%s)"
場景 3:資源洩漏
# 1. 列出所有孤立資源(沒有 owner reference)
kubectl get deployment,svc,cm --all-namespaces -o json | \
jq -r '.items[] | select(.metadata.ownerReferences == null) | "\(.metadata.namespace)/\(.kind)/\(.metadata.name)"'
# 2. 清理孤立資源
kubectl delete deployment <name> -n <namespace>
# 3. 確保 Controller 設置了 Owner Reference
# 代碼檢查:
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
return err
}
11.4 最佳實踐總結
11.4.1 開發階段
使用 Kind 本地測試
make kind-cluster make install make run頻繁運行測試
make test make test-e2e使用 Linter
golangci-lint run ./...
11.4.2 部署階段
使用 Kustomize 管理環境
config/ ├── base/ └── overlays/ ├── development/ ├── staging/ └── production/配置資源限制
resources: requests: cpu: 100m memory: 128Mi limits: cpu: 500m memory: 512Mi啟用 Leader Election
LeaderElection: true
11.4.3 運維階段
監控關鍵指標
Reconcile 成功率
Reconcile 延遲
資源使用率
配置告警
Operator Down
高錯誤率
資源洩漏
定期備份
kubectl get nginx --all-namespaces -o yaml > backup.yaml
總結
本教學從基礎概念到高級實踐,完整覆蓋了 Kubernetes Operator 開發的各個方面:
第一~三章:Operator 基礎概念、架構設計
第四~六章:深入 OpenTelemetry Operator 代碼分析
第七~八章:開發環境、測試實踐
第九章:完整的 Nginx Operator 實戰
第十章:性能、安全、可觀測性等進階主題
第十一章:故障排查和最佳實踐
下一步建議:
研究優秀的開源 Operator 項目
在實際項目中實踐所學知識
成為 Kubernetes Operator 開發專家! Go 🚀






