Skip to main content

Command Palette

Search for a command to run...

📚 透過 OpenTelemetry Operator 學習 Kubernetes Operator 開發指南

Published
62 min readView as Markdown

📚 透過 OpenTelemetry Operator 深度學習 Kubernetes Operator 開發

本教程基於生產級專案 OpenTelemetry Operator 的實際代碼

涵蓋從基礎概念到高級實戰的完整學習路徑


目錄

  1. Kubernetes Operator 核心概念

  2. OpenTelemetry Operator 架構深度解析

  3. CRD 完整剖析與實戰

  4. Controller/Reconciler 深度實現

  5. Manifest 構建器詳解

  6. Webhook 機制深度解析

  7. 開發環境完整設置

  8. 測試策略與實踐

  9. 實戰專案:Nginx Operator

  10. 進階主題與最佳實踐

  11. 常見問題與調試技巧


一、Kubernetes Operator 核心概念

1.1 什麼是 Operator?

Operator 是 Kubernetes 的一種擴展模式,用於自動化複雜應用的部署和管理。它將人類運維知識編碼到軟體中。

核心組成部分

┌─────────────────────────────────────────────────────────┐
│                     Kubernetes Operator                 │
│                                                          │
│  ┌────────────────┐  ┌──────────────┐  ┌─────────────┐ │
│  │  Custom        │  │  Controller  │  │   Domain    │ │
│  │  Resource      │◄─┤  (Reconciler)├─►│  Knowledge  │ │
│  │  Definition    │  │              │  │             │ │
│  └────────────────┘  └──────────────┘  └─────────────┘ │
│         │                    │                 │        │
│         │                    │                 │        │
│         ▼                    ▼                 ▼        │
│    擴展 K8s API        監控&調和狀態      運維邏輯編碼  │
└─────────────────────────────────────────────────────────┘

1.2 工作原理

User (kubectl apply)
    │
    ▼
┌─────────────────────────────────────┐
│  Custom Resource (CR)                │
│  例如: OpenTelemetryCollector        │
│                                      │
│  apiVersion: opentelemetry.io/v1beta1│
│  kind: OpenTelemetryCollector        │
│  spec:                               │
│    mode: deployment                  │
│    config: {...}                     │
└─────────────────────────────────────┘
    │ (儲存到 etcd)
    ▼
┌─────────────────────────────────────┐
│  Kubernetes API Server               │
│  (發送 Watch 事件)                   │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Controller (Reconciler)             │
│                                      │
│  1. 接收事件                         │
│  2. 讀取 CR                          │
│  3. 計算期望狀態                     │
│  4. 調和當前狀態 → 期望狀態          │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Kubernetes Resources                │
│  - Deployment                        │
│  - Service                           │
│  - ConfigMap                         │
│  - ServiceAccount                    │
│  - ...                               │
└─────────────────────────────────────┘

1.3 Operator 能力等級

等級能力OpenTelemetry Operator 支持
1基本安裝
2無縫升級✅ (自動版本升級)
3完整生命週期✅ (Finalizer、備份配置)
4深度洞察✅ (Metrics、Events)
5自動調優✅ (HPA、Target Allocator)

二、OpenTelemetry Operator 架構深度解析

2.1 專案完整結構

opentelemetry-operator/
├── main.go                         # 程式入口點 (200+ 行)
│
├── apis/                           # CRD API 定義
│   ├── v1alpha1/                   # Alpha API
│   │   ├── instrumentation_types.go       # 自動埋點 CRD
│   │   ├── opampbridge_types.go          # OpAMP Bridge CRD
│   │   └── targetallocator_types.go      # Target Allocator CRD (deprecated)
│   └── v1beta1/                    # Beta API (穩定版)
│       ├── opentelemetrycollector_types.go  # Collector CRD (主要)
│       ├── targetallocator_types.go         # Target Allocator CRD
│       └── config.go                        # 配置解析
│
├── internal/                       # 內部實現
│   ├── controllers/                # 控制器實現
│   │   ├── opentelemetrycollector_controller.go  # 主控制器 (400+ 行)
│   │   ├── targetallocator_controller.go
│   │   ├── opampbridge_controller.go
│   │   └── reconcile_test.go              # 測試 (56KB!)
│   │
│   ├── manifests/                  # K8s 資源構建器
│   │   ├── collector/              # Collector 資源構建
│   │   │   ├── deployment.go      # Deployment 構建
│   │   │   ├── daemonset.go       # DaemonSet 構建
│   │   │   ├── statefulset.go     # StatefulSet 構建
│   │   │   ├── container.go       # Container 構建 (核心邏輯)
│   │   │   ├── service.go         # Service 構建
│   │   │   ├── configmap.go       # ConfigMap 構建
│   │   │   └── ...
│   │   ├── targetallocator/        # Target Allocator 資源構建
│   │   └── manifestutils/          # 工具函數
│   │
│   ├── webhook/                    # Admission Webhooks
│   │   ├── podmutation/            # Pod Mutation Webhook
│   │   │   └── webhookhandler.go  # Sidecar/Instrumentation 注入
│   │   └── validation/             # Validation Webhook
│   │
│   ├── config/                     # Operator 配置管理
│   ├── rbac/                       # RBAC 工具
│   ├── autodetect/                 # 環境自動檢測
│   │   ├── prometheus/             # Prometheus Operator 檢測
│   │   ├── openshift/              # OpenShift 檢測
│   │   └── certmanager/            # cert-manager 檢測
│   └── status/                     # Status 更新邏輯
│
├── pkg/                            # 公開包
│   ├── collector/                  # Collector 工具
│   │   └── upgrade/                # 版本升級邏輯
│   ├── sidecar/                    # Sidecar 注入邏輯
│   ├── featuregate/                # Feature Gate 管理
│   └── constants/                  # 常量定義
│
├── config/                         # Kustomize 部署配置
│   ├── crd/                        # CRD YAML 文件
│   │   └── bases/                  # 基礎 CRD
│   ├── rbac/                       # RBAC 規則
│   ├── manager/                    # Operator Deployment
│   ├── webhook/                    # Webhook 配置
│   └── samples/                    # CR 範例
│
├── tests/                          # 測試套件
│   ├── e2e/                        # 基礎 E2E 測試
│   ├── e2e-instrumentation/        # 自動埋點測試
│   ├── e2e-targetallocator/        # Target Allocator 測試
│   ├── e2e-upgrade/                # 升級測試
│   └── test-e2e-apps/              # 測試應用
│
├── cmd/                            # 額外的可執行程序
│   ├── otel-allocator/             # Target Allocator 服務
│   ├── operator-opamp-bridge/      # OpAMP Bridge 服務
│   └── gather/                     # 故障排查工具
│
├── Makefile                        # 構建腳本
├── go.mod                          # Go 依賴管理
└── versions.txt                    # 版本管理文件

2.2 四大核心 CRD 詳解

2.2.1 OpenTelemetryCollector (主要 CRD)

檔案: apis/v1beta1/opentelemetrycollector_types.go

用途: 管理 OpenTelemetry Collector 的部署

支持的部署模式:

const (
    ModeDeployment  Mode = "deployment"   // 標準部署
    ModeDaemonSet   Mode = "daemonset"    // 每個節點一個實例
    ModeStatefulSet Mode = "statefulset"  // 有狀態部署
    ModeSidecar     Mode = "sidecar"      // 注入到其他 Pod
)

功能特性:

  • ✅ 多種部署模式

  • ✅ 自動配置管理(ConfigMap 版本控制)

  • ✅ Target Allocator 集成

  • ✅ HPA 自動擴縮容

  • ✅ 版本自動升級

  • ✅ Ingress/Route 支持

  • ✅ PodMonitor/ServiceMonitor 支持

2.2.2 Instrumentation

檔案: apis/v1alpha1/instrumentation_types.go

用途: 配置自動埋點(Auto-instrumentation)

支持的語言:

  • Java (OpenTelemetry Java Agent)

  • Node.js (OpenTelemetry Node.js)

  • Python (OpenTelemetry Python)

  • .NET (OpenTelemetry .NET)

  • Go (eBPF-based)

  • Apache HTTPD

  • Nginx

2.2.3 TargetAllocator

檔案: apis/v1beta1/targetallocator_types.go

用途: 分配 Prometheus 抓取目標到多個 Collector 實例

分配策略:

  • least-weighted: 最少加權(基於目標數量)

  • consistent-hashing: 一致性哈希

  • per-node: 每個節點一個 Allocator

2.2.4 OpAMPBridge

檔案: apis/v1alpha1/opampbridge_types.go

用途: 實現 OpAMP 協議,實現遠程管理 Collector


三、CRD 完整剖析與實戰

3.1 CRD 結構深度解析

3.1.1 OpenTelemetryCollector CRD 完整定義

檔案: apis/v1beta1/opentelemetrycollector_types.go:32-145

// +kubebuilder:object:root=true
// +kubebuilder:resource:shortName=otelcol;otelcols
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.scale.replicas,selectorpath=.status.scale.selector
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Version",type="string",JSONPath=".status.version",description="OpenTelemetry Version"
// +kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.scale.statusReplicas"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// +kubebuilder:printcolumn:name="Image",type="string",JSONPath=".status.image"
// +kubebuilder:printcolumn:name="Management",type="string",JSONPath=".spec.managementState",description="Management State"

type OpenTelemetryCollector struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   OpenTelemetryCollectorSpec   `json:"spec,omitempty"`
    Status OpenTelemetryCollectorStatus `json:"status,omitempty"`
}

3.1.2 Kubebuilder 註解完全指南

基礎註解

註解說明範例
+kubebuilder:object:root=true標記為 CRD 根物件必須添加
+kubebuilder:subresource:status啟用 status 子資源允許獨立更新 status
+kubebuilder:subresource:scale啟用 scale 子資源支持 kubectl scale
+kubebuilder:resource:shortName定義簡稱otelcol,otelcols
+kubebuilder:storageversion標記為存儲版本多版本時指定

驗證註解

// 數值驗證
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=100
Replicas int32 `json:"replicas,omitempty"`

// 字符串驗證
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=255
// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
Name string `json:"name,omitempty"`

// 枚舉驗證
// +kubebuilder:validation:Enum=deployment;daemonset;statefulset;sidecar
Mode string `json:"mode,omitempty"`

// 默認值
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`

// CEL 驗證 (Kubernetes 1.25+)
// +kubebuilder:validation:XValidation:rule="self.mode != 'sidecar' || !has(self.replicas)",message="sidecar mode does not support replicas"
Spec OpenTelemetryCollectorSpec `json:"spec,omitempty"`

顯示註解

// kubectl get 輸出欄位
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

3.1.3 完整 Spec 結構

type OpenTelemetryCollectorSpec struct {
    // ===== 部署配置 =====

    // 部署模式
    // +optional
    // +kubebuilder:default=deployment
    Mode Mode `json:"mode,omitempty"`

    // 副本數(僅 Deployment/StatefulSet 模式)
    // +optional
    Replicas *int32 `json:"replicas,omitempty"`

    // 容器鏡像
    // +optional
    Image string `json:"image,omitempty"`

    // ===== Collector 配置 =====

    // Collector 配置(YAML 格式)
    // +required
    // +kubebuilder:pruning:PreserveUnknownFields
    Config Config `json:"config"`

    // ConfigMap 版本保留數量(用於回滾)
    // +optional
    // +kubebuilder:default:=3
    // +kubebuilder:validation:Minimum:=1
    ConfigVersions int `json:"configVersions,omitempty"`

    // ===== 升級策略 =====

    // 升級策略:automatic 或 none
    // +optional
    UpgradeStrategy UpgradeStrategy `json:"upgradeStrategy"`

    // Deployment 更新策略
    // +optional
    DeploymentUpdateStrategy appsv1.DeploymentStrategy `json:"deploymentUpdateStrategy,omitempty"`

    // DaemonSet 更新策略
    // +optional
    DaemonSetUpdateStrategy appsv1.DaemonSetUpdateStrategy `json:"daemonSetUpdateStrategy,omitempty"`

    // ===== 資源配置 =====

    // 資源限制
    // +optional
    Resources v1.ResourceRequirements `json:"resources,omitempty"`

    // 環境變數
    // +optional
    Env []v1.EnvVar `json:"env,omitempty"`

    // Volume Mounts
    // +optional
    VolumeMounts []v1.VolumeMount `json:"volumeMounts,omitempty"`

    // Volumes
    // +optional
    Volumes []v1.Volume `json:"volumes,omitempty"`

    // ===== 調度配置 =====

    // Node Selector
    // +optional
    NodeSelector map[string]string `json:"nodeSelector,omitempty"`

    // Tolerations
    // +optional
    Tolerations []v1.Toleration `json:"tolerations,omitempty"`

    // Affinity
    // +optional
    Affinity *v1.Affinity `json:"affinity,omitempty"`

    // ===== 網路配置 =====

    // Ingress 配置
    // +optional
    Ingress Ingress `json:"ingress,omitempty"`

    // Service 端口
    // +optional
    Ports []PortsSpec `json:"ports,omitempty"`

    // ===== 高級功能 =====

    // Target Allocator
    // +optional
    TargetAllocator TargetAllocatorEmbedded `json:"targetAllocator,omitempty"`

    // HPA 配置
    // +optional
    Autoscaler *AutoscalerSpec `json:"autoscaler,omitempty"`

    // 探針配置
    // +optional
    LivenessProbe *Probe `json:"livenessProbe,omitempty"`
    ReadinessProbe *Probe `json:"readinessProbe,omitempty"`

    // ===== 可觀測性 =====

    // Observability 配置
    // +optional
    Observability ObservabilitySpec `json:"observability,omitempty"`
}

3.2 實戰範例:從簡單到複雜

3.2.1 最簡範例

檔案: config/samples/core_v1beta1_opentelemetrycollector.yaml

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
    exporters:
      debug: {}
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [debug]

這個 CR 會創建:

# 1. Deployment (1 replica)
kubectl get deployment simplest-collector

# 2. Service (暴露 OTLP 端口)
kubectl get service simplest-collector
# Ports: 4317 (gRPC), 4318 (HTTP)

# 3. Service (headless, 用於 StatefulSet)
kubectl get service simplest-collector-headless

# 4. ConfigMap (Collector 配置)
kubectl get configmap simplest-collector

# 5. ServiceAccount
kubectl get serviceaccount simplest-collector

3.2.2 生產級範例

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: production-collector
  labels:
    env: production
spec:
  # ===== 部署配置 =====
  mode: deployment
  replicas: 3  # 高可用
  image: otel/opentelemetry-collector-k8s:0.88.0

  # ===== 升級策略 =====
  upgradeStrategy: automatic
  deploymentUpdateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

  # ===== ConfigMap 版本控制 =====
  configVersions: 5  # 保留 5 個版本用於回滾

  # ===== Collector 配置 =====
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      # Prometheus receiver
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 30s
              static_configs:
                - targets: ['0.0.0.0:8888']

    processors:
      # 內存限制器(防止 OOM)
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15

      # 批處理(提升性能)
      batch:
        send_batch_size: 10000
        timeout: 10s

      # 資源屬性
      resource:
        attributes:
          - key: cluster.name
            value: production-cluster
            action: upsert

    exporters:
      # OTLP 導出到後端
      otlp:
        endpoint: backend.example.com:4317
        tls:
          insecure: false
          cert_file: /certs/tls.crt
          key_file: /certs/tls.key

      # Prometheus 導出
      prometheus:
        endpoint: 0.0.0.0:8889

    # Health Check Extension
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133

      pprof:
        endpoint: localhost:1777

    service:
      extensions: [health_check, pprof]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [otlp]

        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [otlp, prometheus]

  # ===== 資源限制 =====
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

  # ===== 環境變數 =====
  env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP

  # ===== 調度配置 =====
  nodeSelector:
    workload: telemetry

  tolerations:
    - key: telemetry
      operator: Equal
      value: "true"
      effect: NoSchedule

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - production-collector-collector
            topologyKey: kubernetes.io/hostname

  # ===== Ingress 配置 =====
  ingress:
    type: ingress
    hostname: collector.example.com
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    tls:
      - secretName: collector-tls
        hosts:
          - collector.example.com

  # ===== HPA 自動擴縮容 =====
  autoscaler:
    minReplicas: 3
    maxReplicas: 10
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
          - type: Percent
            value: 50
            periodSeconds: 60
      scaleUp:
        stabilizationWindowSeconds: 0
        policies:
          - type: Percent
            value: 100
            periodSeconds: 15
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 75
      - type: Resource
        resource:
          name: memory
          target:
            type: Utilization
            averageUtilization: 80

  # ===== 探針配置 =====
  livenessProbe:
    httpGet:
      path: /
      port: 13133
    initialDelaySeconds: 15
    periodSeconds: 10

  readinessProbe:
    httpGet:
      path: /
      port: 13133
    initialDelaySeconds: 10
    periodSeconds: 5

  # ===== Target Allocator =====
  targetAllocator:
    enabled: true
    replicas: 3
    allocationStrategy: consistent-hashing
    prometheusCR:
      enabled: true
      serviceMonitorSelector: {}
      podMonitorSelector: {}

  # ===== 可觀測性 =====
  observability:
    metrics:
      enableMetrics: true

3.2.3 DaemonSet 模式範例

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: node-collector
spec:
  mode: daemonset  # 每個節點一個實例

  # DaemonSet 更新策略
  daemonSetUpdateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

  # Host Network(訪問節點級資源)
  hostNetwork: true

  config:
    receivers:
      # 主機指標
      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          load: {}
          memory: {}
          disk: {}
          filesystem: {}
          network: {}

      # Kubernetes Events
      k8s_events:
        namespaces: [default, kube-system]

    exporters:
      otlp:
        endpoint: central-collector:4317

    service:
      pipelines:
        metrics:
          receivers: [hostmetrics]
          exporters: [otlp]

四、Controller/Reconciler 深度實現

4.1 Reconciler 結構體詳解

檔案: internal/controllers/opentelemetrycollector_controller.go:58-67

type OpenTelemetryCollectorReconciler struct {
    // K8s 客戶端(帶緩存)
    client.Client

    // 事件記錄器(發送 K8s 事件)
    recorder record.EventRecorder

    // API Scheme(資源類型註冊表)
    scheme *runtime.Scheme

    // 結構化日誌器
    log logr.Logger

    // Operator 配置
    config config.Config

    // RBAC 權限審查器
    reviewer *internalRbac.Reviewer

    // 版本升級處理器
    upgrade *upgrade.VersionUpgrade
}

4.1.1 依賴組件解析

Client.Client

// controller-runtime 提供的智能客戶端
// 特性:
// 1. 自動緩存(減少 API Server 壓力)
// 2. 類型安全
// 3. 支持 FieldIndexer(加速查詢)
// 4. 自動處理 RESTMapper

// 使用範例
err := r.Client.Get(ctx, types.NamespacedName{
    Name:      "my-collector",
    Namespace: "default",
}, &collector)

EventRecorder

// 記錄 Kubernetes 事件
// kubectl describe 時可以看到

r.recorder.Event(
    &instance,                    // 相關物件
    corev1.EventTypeNormal,       // 事件類型
    "Created",                    // 原因
    "Created OpenTelemetry Collector deployment",  // 消息
)

Reviewer (RBAC 檢查器)

// 檢查 Collector 配置中的 RBAC 需求
// 例如:如果配置中有 k8s_cluster receiver,需要額外權限

if reviewer.NeedsClusterRole(collectorConfig) {
    // 創建 ClusterRole 和 ClusterRoleBinding
}

4.2 Reconcile 完整流程追蹤

檔案: internal/controllers/opentelemetrycollector_controller.go:234-314

讓我們逐步追蹤一個完整的 Reconcile 流程:

func (r *OpenTelemetryCollectorReconciler) Reconcile(
    ctx context.Context,
    req ctrl.Request,
) (ctrl.Result, error) {
    log := r.log.WithValues("opentelemetrycollector", req.NamespacedName)

    // ==================== 階段 1: 獲取 CR ====================
    log.Info("Reconciling OpenTelemetryCollector")

    var instance v1beta1.OpenTelemetryCollector
    if err := r.Get(ctx, req.NamespacedName, &instance); err != nil {
        if !apierrors.IsNotFound(err) {
            log.Error(err, "unable to fetch OpenTelemetryCollector")
        }
        // 資源已刪除,忽略錯誤
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // ==================== 階段 2: 構建參數 ====================
    // 包含所有需要的配置、Target Allocator 等
    params, err := r.GetParams(ctx, instance)
    if err != nil {
        log.Error(err, "Failed to create manifest.Params")
        return ctrl.Result{}, err
    }

    // ==================== 階段 3: 處理刪除(Finalizer 模式)====================
    if deletionTimestamp := instance.GetDeletionTimestamp(); deletionTimestamp != nil {
        log.Info("Resource is being deleted")

        if controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
            log.Info("Running finalizer logic")

            // 執行清理:刪除 ClusterRole、ClusterRoleBinding 等集群級資源
            if err = r.finalizeCollector(ctx, params); err != nil {
                log.Error(err, "Failed to finalize")
                return ctrl.Result{}, err
            }

            // 移除 Finalizer
            if controllerutil.RemoveFinalizer(&instance, collectorFinalizer) {
                err = r.Update(ctx, &instance)
                if err != nil {
                    return ctrl.Result{}, err
                }
            }
            log.Info("Finalizer removed, resource will be deleted")
        }
        return ctrl.Result{}, nil
    }

    // ==================== 階段 4: 檢查管理狀態 ====================
    // ManagementStateUnmanaged: Operator 不管理此資源
    if instance.Spec.ManagementState == v1beta1.ManagementStateUnmanaged {
        log.Info("Skipping reconciliation for unmanaged resource")
        return ctrl.Result{}, nil
    }

    // ==================== 階段 5: 版本升級 ====================
    // 檢查 CR 配置是否需要升級到新格式
    if r.upgrade.NeedsUpgrade(instance) {
        log.Info("Upgrading OpenTelemetryCollector configuration")

        err = r.upgrade.Upgrade(ctx, instance)
        if err != nil {
            log.Error(err, "Failed to upgrade")
            return ctrl.Result{}, err
        }

        // CR 被修改,重新排隊觸發新的 reconcile
        log.Info("Configuration upgraded, requeuing")
        return ctrl.Result{Requeue: true, RequeueAfter: 1 * time.Second}, nil
    }

    // ==================== 階段 6: 添加 Finalizer ====================
    // Finalizer 確保在刪除時執行清理邏輯
    if !controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
        log.Info("Adding finalizer")

        if controllerutil.AddFinalizer(&instance, collectorFinalizer) {
            err = r.Update(ctx, &instance)
            if err != nil {
                return ctrl.Result{}, err
            }
        }
    }

    // ==================== 階段 7: 構建期望資源 ====================
    log.Info("Building desired objects")

    desiredObjects, buildErr := BuildCollector(params)
    if buildErr != nil {
        log.Error(buildErr, "Failed to build desired objects")
        return ctrl.Result{}, buildErr
    }

    log.Info("Desired objects built", "count", len(desiredObjects))

    // ==================== 階段 8: 查找現有資源 ====================
    log.Info("Finding owned objects")

    ownedObjects, err := r.findOtelOwnedObjects(ctx, params)
    if err != nil {
        log.Error(err, "Failed to find owned objects")
        return ctrl.Result{}, err
    }

    log.Info("Found owned objects", "count", len(ownedObjects))

    // ==================== 階段 9: 調和資源 ====================
    // 比較期望狀態和當前狀態,執行創建/更新/刪除
    log.Info("Reconciling desired objects")

    err = reconcileDesiredObjects(
        ctx,
        r.Client,
        log,
        &instance,
        params.Scheme,
        desiredObjects,
        ownedObjects,
    )

    // ==================== 階段 10: 更新 Status ====================
    return collectorStatus.HandleReconcileStatus(ctx, log, params, instance, err)
}

4.2.1 詳細追蹤:BuildCollector 函數

檔案: internal/controllers/opentelemetrycollector_controller.go

func BuildCollector(params manifests.Params) ([]client.Object, error) {
    var objects []client.Object

    // 1. ServiceAccount
    if sa := collector.ServiceAccount(params); sa != nil {
        objects = append(objects, sa)
    }

    // 2. ConfigMap(Collector 配置)
    configMaps, err := collector.ConfigMaps(params)
    if err != nil {
        return nil, err
    }
    for _, cm := range configMaps {
        objects = append(objects, cm)
    }

    // 3. 根據模式選擇工作負載類型
    switch params.OtelCol.Spec.Mode {
    case v1alpha1.ModeDeployment:
        deployment, err := collector.Deployment(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, deployment)

    case v1alpha1.ModeDaemonSet:
        daemonset, err := collector.DaemonSet(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, daemonset)

    case v1alpha1.ModeStatefulSet:
        statefulset, err := collector.StatefulSet(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, statefulset)
    }

    // 4. Service(暴露端口)
    services := collector.Services(params)
    for _, svc := range services {
        objects = append(objects, svc)
    }

    // 5. Ingress(如果配置)
    if params.OtelCol.Spec.Ingress.Type == v1beta1.IngressTypeIngress {
        if ingress := collector.Ingress(params); ingress != nil {
            objects = append(objects, ingress)
        }
    }

    // 6. HPA(如果配置自動擴縮容)
    if params.OtelCol.Spec.Autoscaler != nil {
        if hpa := collector.HorizontalPodAutoscaler(params); hpa != nil {
            objects = append(objects, hpa)
        }
    }

    // 7. RBAC(如果需要)
    if params.Config.CreateRBACPermissions == rbac.Available {
        if clusterRole := collector.ClusterRole(params); clusterRole != nil {
            objects = append(objects, clusterRole)
        }
        if clusterRoleBinding := collector.ClusterRoleBinding(params); clusterRoleBinding != nil {
            objects = append(objects, clusterRoleBinding)
        }
    }

    // 8. ServiceMonitor/PodMonitor(可觀測性)
    if params.OtelCol.Spec.Observability.Metrics.EnableMetrics {
        if sm := collector.ServiceMonitor(params); sm != nil {
            objects = append(objects, sm)
        }
    }

    // 9. Target Allocator(如果啟用)
    if params.OtelCol.Spec.TargetAllocator.Enabled {
        taObjects, err := BuildTargetAllocator(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, taObjects...)
    }

    return objects, nil
}

4.2.2 調和邏輯:reconcileDesiredObjects

func reconcileDesiredObjects(
    ctx context.Context,
    client client.Client,
    log logr.Logger,
    owner client.Object,
    scheme *runtime.Scheme,
    desired []client.Object,
    existing map[types.UID]client.Object,
) error {
    // 為期望的資源設置 OwnerReference
    for _, obj := range desired {
        if err := controllerutil.SetControllerReference(owner, obj, scheme); err != nil {
            return err
        }
    }

    // ========== 步驟 1: 創建或更新期望的資源 ==========
    for _, desiredObj := range desired {
        log := log.WithValues(
            "kind", desiredObj.GetObjectKind().GroupVersionKind().Kind,
            "name", desiredObj.GetName(),
        )

        // 嘗試獲取現有資源
        existingObj := desiredObj.DeepCopyObject().(client.Object)
        err := client.Get(ctx, types.NamespacedName{
            Name:      desiredObj.GetName(),
            Namespace: desiredObj.GetNamespace(),
        }, existingObj)

        if err != nil && apierrors.IsNotFound(err) {
            // 資源不存在,創建
            log.Info("Creating resource")
            if err := client.Create(ctx, desiredObj); err != nil {
                log.Error(err, "Failed to create resource")
                return err
            }
            log.Info("Resource created successfully")

        } else if err != nil {
            // 其他錯誤
            return err

        } else {
            // 資源存在,更新
            log.Info("Updating resource")

            // 保留不應該變更的字段
            desiredObj.SetResourceVersion(existingObj.GetResourceVersion())

            if err := client.Update(ctx, desiredObj); err != nil {
                log.Error(err, "Failed to update resource")
                return err
            }
            log.Info("Resource updated successfully")

            // 從待刪除列表中移除
            delete(existing, existingObj.GetUID())
        }
    }

    // ========== 步驟 2: 刪除多餘的資源 ==========
    for uid, obj := range existing {
        log := log.WithValues(
            "kind", obj.GetObjectKind().GroupVersionKind().Kind,
            "name", obj.GetName(),
            "uid", uid,
        )

        log.Info("Deleting orphaned resource")
        if err := client.Delete(ctx, obj); err != nil {
            log.Error(err, "Failed to delete resource")
            return err
        }
        log.Info("Orphaned resource deleted")
    }

    return nil
}

4.3 索引與緩存優化

4.3.1 設置 Field Indexer

檔案: internal/controllers/opentelemetrycollector_controller.go:334-354

func (r *OpenTelemetryCollectorReconciler) SetupCaches(cluster cluster.Cluster) error {
    ownedResources := r.GetOwnedResourceTypes()

    // 為每種資源類型建立索引
    for _, resource := range ownedResources {
        if err := cluster.GetCache().IndexField(
            context.Background(),
            resource,
            resourceOwnerKey,  // 索引鍵:".metadata.owner"
            func(rawObj client.Object) []string {
                // 提取 Owner Reference
                owner := metav1.GetControllerOf(rawObj)
                if owner == nil {
                    return nil
                }

                // 只索引 OpenTelemetryCollector 擁有的資源
                if owner.Kind != "OpenTelemetryCollector" {
                    return nil
                }

                // 返回 Owner 名稱作為索引值
                return []string{owner.Name}
            },
        ); err != nil {
            return err
        }
    }
    return nil
}

4.3.2 使用索引加速查詢

func (r *OpenTelemetryCollectorReconciler) findOtelOwnedObjects(
    ctx context.Context,
    params manifests.Params,
) (map[types.UID]client.Object, error) {
    ownedObjects := map[types.UID]client.Object{}

    // 使用索引快速查詢(而不是掃描所有資源)
    listOpts := []client.ListOption{
        client.InNamespace(params.OtelCol.Namespace),
        client.MatchingFields{resourceOwnerKey: params.OtelCol.Name},
    }

    for _, objectType := range r.GetOwnedResourceTypes() {
        objs, err := getList(ctx, r, objectType, listOpts...)
        if err != nil {
            return nil, err
        }

        for uid, object := range objs {
            ownedObjects[uid] = object
        }
    }

    return ownedObjects, nil
}

性能對比

方法時間複雜度說明
不使用索引O(N)N = 集群中所有 Deployment 數量
使用索引O(M)M = 被這個 Collector 擁有的 Deployment 數量

在大型集群中,性能提升可達 10-100 倍!


五、Manifest 構建器詳解

5.1 Deployment 構建完整解析

檔案: internal/manifests/collector/deployment.go

func Deployment(params manifests.Params) (*appsv1.Deployment, error) {
    name := naming.Collector(params.OtelCol.Name)

    // ========== 1. 構建標籤 ==========
    // 標籤用於:
    // - Service 選擇器
    // - HPA 目標選擇
    // - 監控發現
    labels := manifestutils.Labels(
        params.OtelCol.ObjectMeta,
        name,
        params.OtelCol.Spec.Image,
        ComponentOpenTelemetryCollector,
        params.Config.LabelsFilter,
    )

    // 標籤範例:
    // app.kubernetes.io/name: my-collector-collector
    // app.kubernetes.io/instance: my-collector
    // app.kubernetes.io/managed-by: opentelemetry-operator
    // app.kubernetes.io/component: opentelemetry-collector
    // app.kubernetes.io/version: 0.88.0

    // ========== 2. 構建註解 ==========
    annotations, err := manifestutils.Annotations(
        params.OtelCol,
        params.Config.AnnotationsFilter,
    )
    if err != nil {
        return nil, err
    }

    // Pod 註解(額外添加 ConfigMap 哈希)
    podAnnotations, err := manifestutils.PodAnnotations(
        params.OtelCol,
        params.Config.AnnotationsFilter,
    )
    if err != nil {
        return nil, err
    }

    // ========== 3. 構建 Deployment ==========
    return &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:        name,
            Namespace:   params.OtelCol.Namespace,
            Labels:      labels,
            Annotations: annotations,
        },
        Spec: appsv1.DeploymentSpec{
            // 副本數
            Replicas: manifestutils.GetInitialReplicas(params.OtelCol),

            // 選擇器(必須與 Pod 標籤匹配)
            Selector: &metav1.LabelSelector{
                MatchLabels: manifestutils.SelectorLabels(
                    params.OtelCol.ObjectMeta,
                    ComponentOpenTelemetryCollector,
                ),
            },

            // 更新策略
            Strategy: params.OtelCol.Spec.DeploymentUpdateStrategy,

            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels:      labels,
                    Annotations: podAnnotations,
                },
                Spec: corev1.PodSpec{
                    // ServiceAccount
                    ServiceAccountName: ServiceAccountName(params.OtelCol),

                    // Init Containers
                    InitContainers: params.OtelCol.Spec.InitContainers,

                    // 主容器 + 額外容器
                    Containers: append(
                        params.OtelCol.Spec.AdditionalContainers,
                        Container(params.Config, params.Log, params.OtelCol, true),
                    ),

                    // Volumes
                    Volumes: Volumes(params.Config, params.OtelCol),

                    // 網路配置
                    DNSPolicy:   manifestutils.GetDNSPolicy(...),
                    DNSConfig:   &params.OtelCol.Spec.PodDNSConfig,
                    HostNetwork: params.OtelCol.Spec.HostNetwork,

                    // 進程命名空間共享
                    ShareProcessNamespace: &params.OtelCol.Spec.ShareProcessNamespace,

                    // 調度配置
                    Tolerations:               params.OtelCol.Spec.Tolerations,
                    NodeSelector:              params.OtelCol.Spec.NodeSelector,
                    Affinity:                  params.OtelCol.Spec.Affinity,
                    TopologySpreadConstraints: params.OtelCol.Spec.TopologySpreadConstraints,

                    // 安全配置
                    SecurityContext:   params.OtelCol.Spec.PodSecurityContext,
                    PriorityClassName: params.OtelCol.Spec.PriorityClassName,

                    // 優雅終止
                    TerminationGracePeriodSeconds: params.OtelCol.Spec.TerminationGracePeriodSeconds,
                },
            },
        },
    }, nil
}

5.2 Container 構建深度解析

檔案: internal/manifests/collector/container.go:28-150

func Container(
    cfg config.Config,
    logger logr.Logger,
    otelcol v1beta1.OpenTelemetryCollector,
    addConfig bool,
) corev1.Container {
    // ========== 1. 決定鏡像 ==========
    image := otelcol.Spec.Image
    if len(image) == 0 {
        image = cfg.CollectorImage  // 使用預設鏡像
    }

    // ========== 2. 構建端口列表 ==========
    ports := getContainerPorts(logger, otelcol)
    // 從 Collector 配置中解析端口
    // 例如:otlp.protocols.grpc.endpoint: 0.0.0.0:4317
    //      → ContainerPort{Name: "otlp-grpc", ContainerPort: 4317}

    // ========== 3. 構建啟動參數 ==========
    var args []string
    var volumeMounts []corev1.VolumeMount

    // 配置文件始終是第一個參數
    if addConfig {
        args = append(args, fmt.Sprintf(
            "--config=/conf/%s",
            cfg.CollectorConfigMapEntry,
        ))

        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.ConfigMapVolume(),
            MountPath: "/conf",
        })
    }

    // 如果啟用 Target Allocator mTLS
    if otelcol.Spec.TargetAllocator.Enabled &&
       cfg.CertManagerAvailability == certmanager.Available &&
       featuregate.EnableTargetAllocatorMTLS.IsEnabled() {
        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.TAClientCertificate(otelcol.Name),
            MountPath: constants.TACollectorTLSDirPath,
        })
    }

    // 額外的用戶自定義參數(排序保證一致性)
    argsMap := otelcol.Spec.Args
    if argsMap == nil {
        argsMap = map[string]string{}
    }

    var sortedArgs []string
    for k, v := range argsMap {
        sortedArgs = append(sortedArgs, fmt.Sprintf("--%s=%s", k, v))
    }
    sort.Strings(sortedArgs)
    args = append(args, sortedArgs...)

    // ========== 4. Volume Mounts ==========
    if len(otelcol.Spec.VolumeMounts) > 0 {
        volumeMounts = append(volumeMounts, otelcol.Spec.VolumeMounts...)
    }

    // 額外 ConfigMap 掛載
    for _, cm := range otelcol.Spec.ConfigMaps {
        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.ConfigMapExtra(cm.Name),
            MountPath: path.Join("/var/conf", cm.MountPath, naming.ConfigMapExtra(cm.Name)),
        })
    }

    // ========== 5. 健康檢查探針 ==========
    // Liveness Probe
    livenessProbe, err := otelcol.Spec.Config.GetLivenessProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create liveness probe")
    } else {
        defaultProbeSettings(livenessProbe, otelcol.Spec.LivenessProbe)
    }

    // Readiness Probe
    readinessProbe, err := otelcol.Spec.Config.GetReadinessProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create readiness probe")
    } else {
        defaultProbeSettings(readinessProbe, otelcol.Spec.ReadinessProbe)
    }

    // Startup Probe
    startupProbe, err := otelcol.Spec.Config.GetStartupProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create startup probe")
    }

    // ========== 6. 環境變數 ==========
    envVars := otelcol.Spec.Env

    // 添加預設環境變數
    envVars = append(envVars, corev1.EnvVar{
        Name: "POD_NAME",
        ValueFrom: &corev1.EnvVarSource{
            FieldRef: &corev1.ObjectFieldSelector{
                FieldPath: "metadata.name",
            },
        },
    })

    // ========== 7. 返回完整 Container ==========
    return corev1.Container{
        Name:            naming.Container(),
        Image:           image,
        ImagePullPolicy: otelcol.Spec.ImagePullPolicy,
        Args:            args,
        Ports:           ports,
        VolumeMounts:    volumeMounts,
        Env:             envVars,
        EnvFrom:         otelcol.Spec.EnvFrom,
        Resources:       otelcol.Spec.Resources,
        LivenessProbe:   livenessProbe,
        ReadinessProbe:  readinessProbe,
        StartupProbe:    startupProbe,
        SecurityContext: otelcol.Spec.SecurityContext,
    }
}

5.2.1 端口解析邏輯

func getContainerPorts(logger logr.Logger, otelcol v1beta1.OpenTelemetryCollector) []corev1.ContainerPort {
    var ports []corev1.ContainerPort

    // 從 Collector 配置中解析端口
    cfg := otelcol.Spec.Config

    // 解析 receivers 中的端口
    if receivers, ok := cfg.Receivers.Object["otlp"].(map[string]interface{}); ok {
        if protocols, ok := receivers["protocols"].(map[string]interface{}); ok {
            // gRPC 端口
            if grpc, ok := protocols["grpc"].(map[string]interface{}); ok {
                if endpoint, ok := grpc["endpoint"].(string); ok {
                    port := extractPort(endpoint)  // "0.0.0.0:4317" → 4317
                    ports = append(ports, corev1.ContainerPort{
                        Name:          "otlp-grpc",
                        ContainerPort: port,
                        Protocol:      corev1.ProtocolTCP,
                    })
                }
            }

            // HTTP 端口
            if http, ok := protocols["http"].(map[string]interface{}); ok {
                if endpoint, ok := http["endpoint"].(string); ok {
                    port := extractPort(endpoint)  // "0.0.0.0:4318" → 4318
                    ports = append(ports, corev1.ContainerPort{
                        Name:          "otlp-http",
                        ContainerPort: port,
                        Protocol:      corev1.ProtocolTCP,
                    })
                }
            }
        }
    }

    // 用戶自定義端口
    for _, portSpec := range otelcol.Spec.Ports {
        ports = append(ports, corev1.ContainerPort{
            Name:          portSpec.Name,
            ContainerPort: portSpec.Port,
            Protocol:      portSpec.Protocol,
        })
    }

    return ports
}

5.3 ConfigMap 構建與版本控制

檔案: internal/manifests/collector/configmap.go

func ConfigMaps(params manifests.Params) ([]*corev1.ConfigMap, error) {
    var configMaps []*corev1.ConfigMap

    // ========== 1. 構建主 ConfigMap ==========
    name := naming.ConfigMap(params.OtelCol.Name)

    // 序列化 Collector 配置為 YAML
    configYAML, err := params.OtelCol.Spec.Config.Yaml()
    if err != nil {
        return nil, err
    }

    // 計算配置哈希(用於觸發 Pod 重啟)
    configHash := hash(configYAML)

    configMap := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      fmt.Sprintf("%s-%s", name, configHash[:8]),
            Namespace: params.OtelCol.Namespace,
            Labels: manifestutils.Labels(
                params.OtelCol.ObjectMeta,
                name,
                params.OtelCol.Spec.Image,
                ComponentOpenTelemetryCollector,
                params.Config.LabelsFilter,
            ),
            Annotations: map[string]string{
                "opentelemetry.io/config-hash": configHash,
                "opentelemetry.io/created-at":  time.Now().Format(time.RFC3339),
            },
        },
        Data: map[string]string{
            cfg.CollectorConfigMapEntry: configYAML,
        },
    }

    configMaps = append(configMaps, configMap)

    // ========== 2. Target Allocator ConfigMap ==========
    if params.OtelCol.Spec.TargetAllocator.Enabled {
        taConfigMap, err := BuildTargetAllocatorConfigMap(params)
        if err != nil {
            return nil, err
        }
        configMaps = append(configMaps, taConfigMap)
    }

    return configMaps, nil
}

5.3.1 ConfigMap 版本控制實現

檔案: internal/controllers/opentelemetrycollector_controller.go:146-158

func getCollectorConfigMapsToKeep(
    configVersionsToKeep int,
    configMaps []*corev1.ConfigMap,
) []*corev1.ConfigMap {
    configVersionsToKeep = max(1, configVersionsToKeep)

    // 按創建時間排序(最新到最舊)
    sort.Slice(configMaps, func(i, j int) bool {
        iTime := configMaps[i].GetCreationTimestamp().Time
        jTime := configMaps[j].GetCreationTimestamp().Time
        return iTime.After(jTime)
    })

    // 保留最新的 N 個
    configMapsToKeep := min(configVersionsToKeep, len(configMaps))
    return configMaps[:configMapsToKeep]
}

工作流程:

配置更新 → 創建新 ConfigMap → 舊 ConfigMap 保留(用於回滾)

範例:
1. my-collector-abcd1234  (當前使用)
2. my-collector-efgh5678  (保留)
3. my-collector-ijkl9012  (保留)
4. my-collector-mnop3456  (將被刪除)

六、Webhook 機制深度解析

6.1 Webhook 概述

Kubernetes Admission Webhooks 允許在資源創建/更新前進行攔截和修改。

User: kubectl apply -f pod.yaml
    │
    ▼
API Server
    │
    ├──► Mutating Webhook  (修改資源)
    │    - Sidecar 注入
    │    - 添加標籤/註解
    │    - 修改容器配置
    │
    ├──► Validating Webhook (驗證資源)
    │    - 檢查配置合法性
    │    - 執行業務規則
    │
    └──► 存儲到 etcd

6.2 Pod Mutation Webhook 實現

檔案: internal/webhook/podmutation/webhookhandler.go

6.2.1 Webhook Handler 結構

type podMutationWebhook struct {
    client      client.Client     // K8s 客戶端
    decoder     admission.Decoder // 請求解碼器
    logger      logr.Logger       // 日誌
    podMutators []PodMutator      // Pod 變更器鏈
    config      config.Config     // 配置
}

// PodMutator 介面
type PodMutator interface {
    Mutate(ctx context.Context, ns corev1.Namespace, pod corev1.Pod) (corev1.Pod, error)
}

6.2.2 Webhook 註冊

// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=ignore,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,sideEffects=none,admissionReviewVersions=v1

參數解析:

  • path: Webhook 路徑

  • mutating=true: 變更型 Webhook

  • failurePolicy=ignore: 失敗時允許請求繼續(高可用)

  • resources=pods: 只攔截 Pod

  • verbs=create: 只攔截創建操作

6.2.3 Handle 方法完整實現

func (p *podMutationWebhook) Handle(
    ctx context.Context,
    req admission.Request,
) admission.Response {
    log := p.logger.WithValues("namespace", req.Namespace, "name", req.Name)
    log.Info("Webhook called")

    // ========== 1. 解碼 Pod ==========
    pod := corev1.Pod{}
    err := p.decoder.Decode(req, &pod)
    if err != nil {
        log.Error(err, "Failed to decode pod")
        return admission.Errored(http.StatusBadRequest, err)
    }

    log.Info("Pod decoded", "podName", pod.Name)

    // ========== 2. 獲取 Namespace ==========
    ns := corev1.Namespace{}
    err = p.client.Get(ctx, types.NamespacedName{
        Name:      req.Namespace,
        Namespace: "",
    }, &ns)
    if err != nil {
        log.Error(err, "Failed to get namespace")
        res := admission.Errored(http.StatusInternalServerError, err)
        // 設置 Allowed = true,即使錯誤也不阻止 Pod 創建
        res.Allowed = true
        return res
    }

    // ========== 3. 執行所有 Mutator ==========
    originalPod := pod.DeepCopy()

    for i, mutator := range p.podMutators {
        log := log.WithValues("mutatorIndex", i)
        log.Info("Applying mutator")

        pod, err = mutator.Mutate(ctx, ns, pod)
        if err != nil {
            log.Error(err, "Mutator failed")
            res := admission.Errored(http.StatusInternalServerError, err)
            res.Allowed = true  // 錯誤不阻止創建
            return res
        }
    }

    // ========== 4. 檢查是否有修改 ==========
    if reflect.DeepEqual(originalPod, &pod) {
        log.Info("No changes made, allowing")
        return admission.Allowed("no changes")
    }

    // ========== 5. 生成 JSONPatch ==========
    marshaledPod, err := json.Marshal(pod)
    if err != nil {
        log.Error(err, "Failed to marshal pod")
        res := admission.Errored(http.StatusInternalServerError, err)
        res.Allowed = true
        return res
    }

    log.Info("Returning patch response")
    return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)
}

6.3 Sidecar 注入實現

6.3.1 Sidecar Mutator

檔案: internal/webhook/podmutation/sidecar_mutator.go

type sidecarMutator struct {
    client client.Client
    logger logr.Logger
    config config.Config
}

func (s *sidecarMutator) Mutate(
    ctx context.Context,
    ns corev1.Namespace,
    pod corev1.Pod,
) (corev1.Pod, error) {
    log := s.logger.WithValues("pod", pod.Name, "namespace", ns.Name)

    // ========== 1. 檢查是否需要注入 ==========
    // 註解:sidecar.opentelemetry.io/inject
    injectAnnotation, exists := pod.Annotations["sidecar.opentelemetry.io/inject"]
    if !exists || injectAnnotation == "false" {
        log.V(1).Info("Sidecar injection not requested")
        return pod, nil
    }

    log.Info("Sidecar injection requested", "value", injectAnnotation)

    // ========== 2. 查找 OpenTelemetryCollector ==========
    var collectorName string
    if injectAnnotation == "true" {
        // 自動查找(尋找同 namespace 下 mode=sidecar 的 Collector)
        collectorName, err := s.findSidecarCollector(ctx, ns.Name)
        if err != nil {
            return pod, err
        }
        log.Info("Auto-detected collector", "name", collectorName)
    } else {
        // 使用指定的 Collector
        collectorName = injectAnnotation
        log.Info("Using specified collector", "name", collectorName)
    }

    // 獲取 Collector CR
    collector := &v1beta1.OpenTelemetryCollector{}
    err := s.client.Get(ctx, types.NamespacedName{
        Name:      collectorName,
        Namespace: ns.Name,
    }, collector)
    if err != nil {
        log.Error(err, "Failed to get collector")
        return pod, err
    }

    // 驗證模式
    if collector.Spec.Mode != v1alpha1.ModeSidecar {
        return pod, fmt.Errorf("collector %s is not in sidecar mode", collectorName)
    }

    // ========== 3. 注入 Sidecar 容器 ==========
    sidecarContainer := s.buildSidecarContainer(collector)
    pod.Spec.Containers = append(pod.Spec.Containers, sidecarContainer)

    // ========== 4. 添加 Volumes ==========
    volumes := s.buildSidecarVolumes(collector)
    pod.Spec.Volumes = append(pod.Spec.Volumes, volumes...)

    // ========== 5. 修改應用容器環境變數 ==========
    // 將 OTLP endpoint 設置為 localhost
    for i := range pod.Spec.Containers {
        if pod.Spec.Containers[i].Name == sidecarContainer.Name {
            continue  // 跳過 sidecar 容器本身
        }

        pod.Spec.Containers[i].Env = append(
            pod.Spec.Containers[i].Env,
            corev1.EnvVar{
                Name:  "OTEL_EXPORTER_OTLP_ENDPOINT",
                Value: "http://localhost:4317",
            },
        )
    }

    log.Info("Sidecar injected successfully")
    return pod, nil
}

func (s *sidecarMutator) buildSidecarContainer(
    collector *v1beta1.OpenTelemetryCollector,
) corev1.Container {
    return corev1.Container{
        Name:  "otc-sidecar",
        Image: collector.Spec.Image,
        Args: []string{
            "--config=/conf/collector.yaml",
        },
        Ports: []corev1.ContainerPort{
            {Name: "otlp-grpc", ContainerPort: 4317},
            {Name: "otlp-http", ContainerPort: 4318},
        },
        VolumeMounts: []corev1.VolumeMount{
            {
                Name:      "otc-config",
                MountPath: "/conf",
            },
        },
        Resources: collector.Spec.Resources,
    }
}

6.3.2 Sidecar 注入效果

原始 Pod:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    sidecar.opentelemetry.io/inject: "true"
spec:
  containers:
    - name: app
      image: myapp:v1.0

注入後的 Pod:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    sidecar.opentelemetry.io/inject: "true"
    sidecar.opentelemetry.io/injected: "true"
spec:
  containers:
    # 原始應用容器(已修改)
    - name: app
      image: myapp:v1.0
      env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://localhost:4317

    # 注入的 Sidecar 容器
    - name: otc-sidecar
      image: otel/opentelemetry-collector:0.88.0
      args:
        - --config=/conf/collector.yaml
      ports:
        - name: otlp-grpc
          containerPort: 4317
        - name: otlp-http
          containerPort: 4318
      volumeMounts:
        - name: otc-config
          mountPath: /conf

  volumes:
    - name: otc-config
      configMap:
        name: my-sidecar-collector

七、開發環境完整設置

7.1 前置工具安裝

7.1.1 Go 語言環境

# ========== macOS ==========
brew install go@1.24

# 設置環境變數
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin

# 驗證
go version  # 應該顯示 go1.24 或更高

# ========== Linux ==========
# 下載並安裝
wget https://go.dev/dl/go1.24.0.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
sudo tar -C /usr/local -xzf go1.24.0.linux-amd64.tar.gz

# 添加到 PATH(添加到 ~/.bashrc 或 ~/.zshrc)
export PATH=$PATH:/usr/local/go/bin
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin

# 重新加載配置
source ~/.bashrc  # 或 source ~/.zshrc

7.1.2 Docker

# ========== macOS ==========
brew install --cask docker
# 或下載 Docker Desktop: https://www.docker.com/products/docker-desktop

# ========== Linux (Ubuntu/Debian) ==========
# 安裝依賴
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release

# 添加 Docker GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# 設置倉庫
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 安裝 Docker Engine
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# 允許非 root 用戶使用 Docker
sudo usermod -aG docker $USER
newgrp docker

# 驗證
docker --version
docker run hello-world

7.1.3 kubectl

# ========== macOS ==========
brew install kubectl

# ========== Linux ==========
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# 驗證
kubectl version --client

7.1.4 Kind (Kubernetes in Docker)

# ========== macOS ==========
brew install kind

# ========== Linux ==========
go install sigs.k8s.io/kind@v0.20.0

# 或使用二進制
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# 驗證
kind version

7.1.5 Kustomize

# ========== macOS ==========
brew install kustomize

# ========== Linux ==========
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/

# 驗證
kustomize version

7.1.6 controller-gen (Kubebuilder 工具)

# 安裝 controller-gen(生成 CRD 和 deepcopy 代碼)
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

# 驗證
controller-gen --version

7.1.7 Operator SDK (可選)

# ========== macOS ==========
brew install operator-sdk

# ========== Linux ==========
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; *) echo -n $(uname -m) ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
export OPERATOR_SDK_DL_URL=https://github.com/operator-framework/operator-sdk/releases/download/v1.29.0

curl -LO ${OPERATOR_SDK_DL_URL}/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk

# 驗證
operator-sdk version

7.2 專案設置

7.2.1 Clone 專案

# Clone OpenTelemetry Operator
git clone https://github.com/open-telemetry/opentelemetry-operator.git
cd opentelemetry-operator

# 查看分支
git branch -a

# 切換到最新的穩定分支(如果需要)
git checkout main

7.2.2 了解 Makefile

OpenTelemetry Operator 的 Makefile 提供了豐富的命令:

檔案: Makefile

# ========== 查看所有可用命令 ==========
make help

# 主要命令:
make manifests         # 生成 CRD YAML
make generate          # 生成 Go 代碼(deepcopy 等)
make fmt               # 格式化代碼
make vet               # 靜態分析
make test              # 運行單元測試
make docker-build      # 構建 Docker 鏡像
make install           # 安裝 CRD 到集群
make deploy            # 部署 Operator 到集群
make run               # 本地運行 Operator
make kind-cluster      # 創建 Kind 測試集群
make e2e               # 運行 E2E 測試

7.2.3 創建 Kind 集群

檔案: Makefile:101

# ========== 使用 Makefile 創建(推薦)==========
make kind-cluster

# 這個命令會:
# 1. 創建一個名為 "otel-operator" 的 Kind 集群
# 2. 使用配置文件 kind-1.33.yaml(或其他版本)
# 3. 自動安裝 cert-manager

# ========== 手動創建 ==========
# 創建集群配置文件
cat <<EOF > kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    image: kindest/node:v1.33.0
  - role: worker
    image: kindest/node:v1.33.0
  - role: worker
    image: kindest/node:v1.33.0
EOF

# 創建集群
kind create cluster \
  --name otel-operator \
  --config kind-config.yaml

# 驗證集群
kubectl cluster-info --context kind-otel-operator
kubectl get nodes

7.2.4 安裝 cert-manager

檔案: Makefile:106 (CERTMANAGER_VERSION)

# ========== 方式 1: 使用 kubectl ==========
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 等待 cert-manager 就緒
kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager -n cert-manager

kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager-webhook -n cert-manager

kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager-cainjector -n cert-manager

# 驗證
kubectl get pods -n cert-manager

# ========== 方式 2: 使用 Helm ==========
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0 \
  --set installCRDs=true

# 驗證
kubectl get pods -n cert-manager

7.3 本地開發工作流

7.3.1 生成 CRD 和代碼

# ========== 1. 生成 CRD YAML ==========
make manifests

# 這會:
# - 從 apis/**/types.go 的註解生成 CRD YAML
# - 輸出到 config/crd/bases/
# - 使用 controller-gen 工具

# 查看生成的 CRD
ls -lh config/crd/bases/
# opentelemetry.io_instrumentations.yaml
# opentelemetry.io_opentelemetrycollectors.yaml
# opentelemetry.io_targetallocators.yaml
# opentelemetry.io_opampbridges.yaml

# ========== 2. 生成 Go 代碼 ==========
make generate

# 這會:
# - 生成 DeepCopy 方法(zz_generated.deepcopy.go)
# - 使用 controller-gen 工具

# 查看生成的代碼
find apis -name "zz_generated.*.go"

背後的命令(來自 Makefile):

# manifests 目標
controller-gen \
  crd:generateEmbeddedObjectMeta=true,maxDescLen=0 \
  rbac:roleName=manager-role \
  webhook \
  paths="./..." \
  output:crd:artifacts:config=config/crd/bases

# generate 目標
controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."

7.3.2 安裝 CRD 到集群

# ========== 安裝 CRD ==========
make install

# 等價於:
# kustomize build config/crd | kubectl apply -f -

# 驗證 CRD 已安裝
kubectl get crd | grep opentelemetry

# 應該看到:
# instrumentations.opentelemetry.io
# opampbridges.opentelemetry.io
# opentelemetrycollectors.opentelemetry.io
# targetallocators.opentelemetry.io

# 查看 CRD 詳情
kubectl explain opentelemetrycollector
kubectl explain opentelemetrycollector.spec
kubectl explain opentelemetrycollector.spec.config

7.3.3 本地運行 Operator

方式 1: 使用 Makefile(推薦)

檔案: Makefile:202-204

# 本地運行(不部署到集群)
make run

# 這會:
# 1. 執行 make generate fmt vet manifests
# 2. 運行 go run ./main.go --zap-devel
# 3. Operator 運行在本地,連接到 Kind 集群

# 默認禁用 Webhooks(本地運行時)
# 如果要啟用 Webhooks:
make run ENABLE_WEBHOOKS=true

方式 2: 直接使用 go run

# 設置環境變數
export WATCH_NAMESPACE=default  # 只監控 default namespace
# 或不設置,監控所有 namespace

# 禁用 Webhooks(本地運行通常禁用)
export ENABLE_WEBHOOKS=false

# 運行
go run ./main.go \
  --zap-devel \
  --metrics-addr=:8080 \
  --enable-leader-election=false \
  --health-probe-addr=:8081

# 參數說明:
# --zap-devel: 開發模式日誌(更詳細)
# --metrics-addr: Prometheus metrics 地址
# --enable-leader-election: 禁用 leader election(本地單實例)
# --health-probe-addr: 健康檢查地址

查看日誌輸出

2025-01-15T10:00:00.000Z    INFO    setup    Starting the OpenTelemetry Operator
2025-01-15T10:00:00.001Z    INFO    setup    opentelemetry-operator version    {"version": "0.92.0"}
2025-01-15T10:00:00.002Z    INFO    setup    build-date    {"date": "2025-01-15"}
2025-01-15T10:00:00.003Z    INFO    setup    go-version    {"version": "go1.24.0"}
2025-01-15T10:00:00.010Z    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
2025-01-15T10:00:00.011Z    INFO    setup    starting manager
2025-01-15T10:00:00.011Z    INFO    controller    Starting EventSource    {"controller": "opentelemetrycollector", "source": "kind source: *v1beta1.OpenTelemetryCollector"}
2025-01-15T10:00:00.112Z    INFO    controller    Starting Controller    {"controller": "opentelemetrycollector"}
2025-01-15T10:00:00.112Z    INFO    controller    Starting workers    {"controller": "opentelemetrycollector", "worker count": 1}

7.3.4 測試 Operator

在本地 Operator 運行時,打開另一個終端

# ========== 創建測試 CR ==========
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: test-local
spec:
  mode: deployment
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch: {}
    exporters:
      debug:
        verbosity: detailed
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]
EOF

# ========== 觀察 Operator 日誌 ==========
# 在運行 make run 的終端,你會看到:
# INFO    Reconciling OpenTelemetryCollector    {"opentelemetrycollector": "default/test-local"}
# INFO    Building desired objects
# INFO    Creating resource    {"kind": "ServiceAccount", "name": "test-local-collector"}
# INFO    Creating resource    {"kind": "ConfigMap", "name": "test-local-collector-abcd1234"}
# INFO    Creating resource    {"kind": "Deployment", "name": "test-local-collector"}
# INFO    Creating resource    {"kind": "Service", "name": "test-local-collector"}

# ========== 驗證資源 ==========
kubectl get otelcol test-local
kubectl get deployment test-local-collector
kubectl get service test-local-collector
kubectl get configmap -l app.kubernetes.io/instance=test-local

# ========== 查看 Collector 日誌 ==========
kubectl logs -f deployment/test-local-collector

# ========== 測試配置更新 ==========
kubectl patch otelcol test-local --type=merge -p '
{
  "spec": {
    "config": {
      "exporters": {
        "debug": {
          "verbosity": "normal"
        }
      }
    }
  }
}'

# 觀察 Operator 日誌,會看到:
# - 新 ConfigMap 被創建
# - Deployment 被更新(觸發滾動更新)

# ========== 清理 ==========
kubectl delete otelcol test-local

7.3.5 調試技巧

使用 Delve 調試器

# 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest

# 使用 Delve 運行
dlv debug ./main.go -- \
  --zap-devel \
  --enable-leader-election=false

# 設置斷點
(dlv) break internal/controllers/opentelemetrycollector_controller.go:234
(dlv) continue

# 創建 CR 後,斷點會被觸發

使用 VS Code 調試

創建 .vscode/launch.json

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Debug Operator",
      "type": "go",
      "request": "launch",
      "mode": "debug",
      "program": "${workspaceFolder}/main.go",
      "env": {
        "ENABLE_WEBHOOKS": "false"
      },
      "args": [
        "--zap-devel",
        "--enable-leader-election=false"
      ]
    }
  ]
}

按 F5 開始調試,可以設置斷點、查看變數等。

7.3.6 部署到集群(完整流程)

# ========== 1. 構建鏡像 ==========
# 設置鏡像名稱
export IMG=localhost:5000/opentelemetry-operator:dev

make docker-build IMG=$IMG

# ========== 2. 推送到本地 registry(Kind 使用)==========
# 創建本地 registry(如果沒有)
docker run -d -p 5000:5000 --name kind-registry registry:2

# 推送鏡像
docker push $IMG

# 或直接加載到 Kind
kind load docker-image $IMG --name otel-operator

# ========== 3. 部署 Operator ==========
make deploy IMG=$IMG

# 這會:
# 1. 安裝 CRD
# 2. 創建 Namespace (opentelemetry-operator-system)
# 3. 部署 Operator
# 4. 創建 RBAC 資源
# 5. 設置 Webhooks

# ========== 4. 驗證部署 ==========
kubectl get pods -n opentelemetry-operator-system

# 應該看到:
# NAME                                                        READY   STATUS    RESTARTS   AGE
# opentelemetry-operator-controller-manager-xxxxxxxxx-xxxxx   2/2     Running   0          1m

# 查看日誌
kubectl logs -f -n opentelemetry-operator-system \
  deployment/opentelemetry-operator-controller-manager \
  -c manager

# ========== 5. 測試 ==========
kubectl apply -f config/samples/core_v1beta1_opentelemetrycollector.yaml

kubectl get otelcol
kubectl get all -l app.kubernetes.io/managed-by=opentelemetry-operator

# ========== 6. 清理 ==========
make undeploy

7.4 常用開發命令速查

# ========== 代碼生成 ==========
make manifests              # 生成 CRD
make generate               # 生成 deepcopy 代碼
make bundle                 # 生成 Operator bundle(OLM)

# ========== 代碼質量 ==========
make fmt                    # 格式化代碼
make vet                    # 靜態分析
make lint                   # Linting(需要 golangci-lint)
make test                   # 運行單元測試
make test-coverage          # 測試覆蓋率

# ========== 構建 ==========
make manager                # 構建 Operator 二進制
make targetallocator        # 構建 Target Allocator
make operator-opamp-bridge  # 構建 OpAMP Bridge
make docker-build           # 構建 Docker 鏡像

# ========== 部署 ==========
make install                # 安裝 CRD
make uninstall              # 卸載 CRD
make deploy                 # 部署 Operator
make undeploy               # 卸載 Operator
make run                    # 本地運行

# ========== 測試 ==========
make kind-cluster           # 創建 Kind 集群
make kind-delete-cluster    # 刪除 Kind 集群
make e2e                    # E2E 測試
make e2e-targetallocator    # Target Allocator E2E
make e2e-instrumentation    # Instrumentation E2E

# ========== 清理 ==========
make clean                  # 清理生成的文件

7.5 開發環境故障排查

7.5.1 常見問題

問題 1: controller-gen 未找到

# 錯誤信息
controller-gen: command not found

# 解決方案
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

問題 2: Kind 集群無法訪問

# 檢查集群
kind get clusters

# 檢查 kubeconfig
kubectl config get-contexts

# 切換 context
kubectl config use-context kind-otel-operator

問題 3: cert-manager 未就緒

# 檢查 cert-manager pods
kubectl get pods -n cert-manager

# 查看日誌
kubectl logs -n cert-manager deployment/cert-manager

# 重新安裝
kubectl delete namespace cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

問題 4: Webhook 錯誤

# 錯誤信息
Error from server (InternalError): error when creating "test.yaml": Internal error occurred: failed calling webhook

# 解決方案(本地運行時)
export ENABLE_WEBHOOKS=false
make run

八、測試策略與實踐

8.1 測試金字塔

           /\
          /  \
         / E2E \ (少量,關鍵場景)
        /------\
       /  集成  \ (中等,API 交互)
      /----------\
     /  單元測試   \ (大量,快速反饋)
    /--------------\

OpenTelemetry Operator 的測試覆蓋:

  • 單元測試: internal/**/*_test.go

  • 集成測試: tests/e2e*/**

  • 性能測試: 負載測試腳本

8.2 單元測試深度實踐

8.2.1 測試結構

檔案: internal/controllers/reconcile_test.go

這個文件有 56KB,包含大量測試用例!

# 查看測試統計
wc -l internal/controllers/reconcile_test.go
# 約 1500+ 行

# 運行單元測試
make test

# 運行特定包
go test ./internal/controllers -v

# 運行特定測試
go test ./internal/controllers -v -run TestReconcile_Deployment

8.2.2 編寫單元測試範例

創建測試文件: internal/controllers/my_feature_test.go

package controllers_test

import (
    "context"
    "testing"

    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    "sigs.k8s.io/controller-runtime/pkg/client/fake"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    "github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
    "github.com/open-telemetry/opentelemetry-operator/internal/controllers"
)

func TestReconcile_CustomFeature(t *testing.T) {
    // ========== 準備階段 ==========
    ctx := context.Background()

    // 創建測試用的 OpenTelemetryCollector
    otelCol := &v1beta1.OpenTelemetryCollector{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "test-collector",
            Namespace: "default",
        },
        Spec: v1beta1.OpenTelemetryCollectorSpec{
            Mode: v1alpha1.ModeDeployment,
            Config: v1beta1.Config{
                Receivers: v1beta1.AnyConfig{
                    Object: map[string]interface{}{
                        "otlp": map[string]interface{}{
                            "protocols": map[string]interface{}{
                                "grpc": map[string]interface{}{
                                    "endpoint": "0.0.0.0:4317",
                                },
                            },
                        },
                    },
                },
                Exporters: v1beta1.AnyConfig{
                    Object: map[string]interface{}{
                        "debug": map[string]interface{}{},
                    },
                },
                Service: v1beta1.Service{
                    Pipelines: map[string]*v1beta1.Pipeline{
                        "traces": {
                            Receivers: []string{"otlp"},
                            Exporters: []string{"debug"},
                        },
                    },
                },
            },
        },
    }

    // 創建 fake client
    scheme := runtime.NewScheme()
    _ = v1beta1.AddToScheme(scheme)
    _ = appsv1.AddToScheme(scheme)
    _ = corev1.AddToScheme(scheme)

    k8sClient := fake.NewClientBuilder().
        WithScheme(scheme).
        WithObjects(otelCol).
        WithStatusSubresource(&v1beta1.OpenTelemetryCollector{}).
        Build()

    // 創建 Reconciler
    reconciler := &controllers.OpenTelemetryCollectorReconciler{
        Client:   k8sClient,
        Scheme:   scheme,
        log:      logr.Discard(),
        recorder: record.NewFakeRecorder(10),
        config:   config.New(),
    }

    // ========== 執行階段 ==========
    req := reconcile.Request{
        NamespacedName: types.NamespacedName{
            Name:      "test-collector",
            Namespace: "default",
        },
    }

    result, err := reconciler.Reconcile(ctx, req)

    // ========== 驗證階段 ==========
    require.NoError(t, err)
    assert.False(t, result.Requeue)

    // 驗證 Deployment
    deployment := &appsv1.Deployment{}
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector-collector",
        Namespace: "default",
    }, deployment)
    require.NoError(t, err)

    // 驗證副本數
    assert.Equal(t, int32(1), *deployment.Spec.Replicas)

    // 驗證容器
    assert.Len(t, deployment.Spec.Template.Spec.Containers, 1)
    container := deployment.Spec.Template.Spec.Containers[0]
    assert.Equal(t, "otc-container", container.Name)

    // 驗證端口
    require.Len(t, container.Ports, 1)
    assert.Equal(t, "otlp-grpc", container.Ports[0].Name)
    assert.Equal(t, int32(4317), container.Ports[0].ContainerPort)

    // 驗證 Service
    service := &corev1.Service{}
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector-collector",
        Namespace: "default",
    }, service)
    require.NoError(t, err)
    assert.Len(t, service.Spec.Ports, 1)

    // 驗證 ConfigMap
    configMap := &corev1.ConfigMap{}
    configMaps := &corev1.ConfigMapList{}
    err = k8sClient.List(ctx, configMaps,
        client.InNamespace("default"),
        client.MatchingLabels{"app.kubernetes.io/instance": "test-collector"},
    )
    require.NoError(t, err)
    assert.Len(t, configMaps.Items, 1)

    // 驗證 Status
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector",
        Namespace: "default",
    }, otelCol)
    require.NoError(t, err)
    assert.NotEmpty(t, otelCol.Status.Version)
}

// Table-driven 測試範例
func TestReconcile_DifferentModes(t *testing.T) {
    tests := []struct {
        name     string
        mode     v1alpha1.Mode
        replicas *int32
        wantType string
    }{
        {
            name:     "Deployment mode",
            mode:     v1alpha1.ModeDeployment,
            replicas: ptr.To(int32(3)),
            wantType: "Deployment",
        },
        {
            name:     "DaemonSet mode",
            mode:     v1alpha1.ModeDaemonSet,
            replicas: nil,
            wantType: "DaemonSet",
        },
        {
            name:     "StatefulSet mode",
            mode:     v1alpha1.ModeStatefulSet,
            replicas: ptr.To(int32(2)),
            wantType: "StatefulSet",
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // 測試邏輯...
        })
    }
}

8.2.3 運行測試

# ========== 運行所有測試 ==========
make test

# ========== 運行特定包 ==========
go test ./internal/controllers -v

# ========== 運行特定測試 ==========
go test ./internal/controllers -v -run TestReconcile_Deployment

# ========== 運行測試並查看覆蓋率 ==========
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out

# ========== 並行運行測試 ==========
go test ./... -parallel=4

# ========== 運行測試並顯示詳細輸出 ==========
go test ./internal/controllers -v -count=1

# ========== 測試特定功能 ==========
go test ./internal/manifests/collector -v -run TestDeployment

8.3 E2E 測試深度實踐

8.3.1 E2E 測試架構

OpenTelemetry Operator 使用 Chainsaw 進行 E2E 測試。

測試目錄結構

tests/
├── e2e/                       # 基礎功能
│   ├── smoke/                 # 煙霧測試
│   ├── smoke-targetallocator/ # TA 煙霧測試
│   └── smoke-sidecar/         # Sidecar 測試
├── e2e-instrumentation/       # 自動埋點測試
├── e2e-targetallocator/       # Target Allocator
├── e2e-opampbridge/          # OpAMP Bridge
└── e2e-upgrade/              # 升級測試

8.3.2 Chainsaw 測試範例

檔案: tests/e2e/smoke/00-install.yaml

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    exporters:
      debug:
        verbosity: detailed
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [debug]

檔案: tests/e2e/smoke/01-assert.yaml

# 斷言 Deployment 被創建並就緒
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simplest-collector
spec:
  replicas: 1
status:
  readyReplicas: 1
  availableReplicas: 1
---
# 斷言 Service 被創建
apiVersion: v1
kind: Service
metadata:
  name: simplest-collector
spec:
  ports:
    - name: otlp-grpc
      port: 4317
      protocol: TCP
      targetPort: 4317

檔案: tests/e2e/smoke/chainsaw-test.yaml

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: smoke
spec:
  steps:
    - name: Install collector
      try:
        - apply:
            file: 00-install.yaml
        - assert:
            file: 01-assert.yaml
      catch:
        - describe:
            apiVersion: opentelemetry.io/v1beta1
            kind: OpenTelemetryCollector
        - describe:
            apiVersion: apps/v1
            kind: Deployment
        - podLogs:
            selector: app.kubernetes.io/name=simplest-collector

8.3.3 運行 E2E 測試

# ========== 1. 創建測試集群 ==========
make kind-cluster

# ========== 2. 構建並載入鏡像 ==========
make docker-build
make docker-build-targetallocator
make docker-build-operator-opamp-bridge

# 載入鏡像到 Kind
kind load docker-image \
  ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest \
  --name otel-operator

# ========== 3. 部署 Operator ==========
make deploy IMG=ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest

# ========== 4. 運行 E2E 測試 ==========
make e2e

# 運行特定測試
make e2e-instrumentation
make e2e-targetallocator
make e2e-upgrade

# ========== 5. 查看測試結果 ==========
# Chainsaw 會輸出詳細的測試報告

# ========== 6. 清理 ==========
make kind-delete-cluster

8.3.4 編寫自定義 E2E 測試

創建測試目錄

mkdir -p tests/e2e-custom-feature
cd tests/e2e-custom-feature

創建測試清單 (00-install.yaml):

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: custom-test
spec:
  mode: deployment
  replicas: 2
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
    processors:
      batch:
        send_batch_size: 1000
        timeout: 10s
    exporters:
      debug: {}
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]

創建斷言 (01-assert.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-test-collector
spec:
  replicas: 2
status:
  readyReplicas: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: (custom-test-collector-*)
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

創建 Chainsaw 測試 (chainsaw-test.yaml):

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: custom-feature
spec:
  description: Test custom feature
  timeouts:
    apply: 30s
    assert: 60s
  steps:
    - name: Install collector
      try:
        - apply:
            file: 00-install.yaml
        - assert:
            file: 01-assert.yaml
            timeout: 2m
      catch:
        - describe:
            apiVersion: opentelemetry.io/v1beta1
            kind: OpenTelemetryCollector
        - podLogs:
            selector: app.kubernetes.io/name=custom-test-collector

    - name: Test configuration update
      try:
        - patch:
            resource:
              apiVersion: opentelemetry.io/v1beta1
              kind: OpenTelemetryCollector
              metadata:
                name: custom-test
            merge:
              spec:
                replicas: 3
        - sleep:
            duration: 10s
        - assert:
            resource:
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: custom-test-collector
              spec:
                replicas: 3

    - name: Cleanup
      try:
        - delete:
            ref:
              apiVersion: opentelemetry.io/v1beta1
              kind: OpenTelemetryCollector
              name: custom-test

運行自定義測試

# 安裝 Chainsaw
go install github.com/kyverno/chainsaw@latest

# 運行測試
chainsaw test --test-dir ./tests/e2e-custom-feature

8.4 測試覆蓋率

# ========== 生成覆蓋率報告 ==========
go test ./... -coverprofile=coverage.out

# ========== 查看覆蓋率摘要 ==========
go tool cover -func=coverage.out

# 輸出範例:
# github.com/open-telemetry/opentelemetry-operator/internal/controllers/opentelemetrycollector_controller.go:234:    Reconcile        85.7%
# github.com/open-telemetry/opentelemetry-operator/internal/manifests/collector/deployment.go:17:        Deployment        92.3%

# ========== 生成 HTML 報告 ==========
go tool cover -html=coverage.out -o coverage.html

# 在瀏覽器中打開
open coverage.html  # macOS
xdg-open coverage.html  # Linux

# ========== 按包查看覆蓋率 ==========
go test ./internal/controllers -coverprofile=controllers.out
go tool cover -func=controllers.out

8.5 性能測試與基準測試

8.5.1 Benchmark 測試

創建: internal/controllers/reconcile_bench_test.go

package controllers_test

import (
    "context"
    "testing"

    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    "sigs.k8s.io/controller-runtime/pkg/client/fake"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    "github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
)

func BenchmarkReconcile(b *testing.B) {
    // 準備測試數據
    otelCol := &v1beta1.OpenTelemetryCollector{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "benchmark-test",
            Namespace: "default",
        },
        Spec: v1beta1.OpenTelemetryCollectorSpec{
            Mode: v1alpha1.ModeDeployment,
            Config: v1beta1.Config{
                // ... 配置
            },
        },
    }

    k8sClient := fake.NewClientBuilder().
        WithObjects(otelCol).
        Build()

    reconciler := &controllers.OpenTelemetryCollectorReconciler{
        Client: k8sClient,
        // ... 其他字段
    }

    req := reconcile.Request{
        NamespacedName: types.NamespacedName{
            Name:      "benchmark-test",
            Namespace: "default",
        },
    }

    ctx := context.Background()

    // 運行 benchmark
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, err := reconciler.Reconcile(ctx, req)
        if err != nil {
            b.Fatal(err)
        }
    }
}

運行 Benchmark

# 運行 benchmark
go test -bench=. ./internal/controllers

# 輸出範例:
# BenchmarkReconcile-8           1000       1234567 ns/op

# 帶內存分析
go test -bench=. -benchmem ./internal/controllers

# 輸出範例:
# BenchmarkReconcile-8           1000       1234567 ns/op      123456 B/op        1234 allocs/op

# 生成 CPU profile
go test -bench=. -cpuprofile=cpu.prof ./internal/controllers

# 分析 profile
go tool pprof cpu.prof

九、實戰專案:Nginx Operator

在本章中,我們將從零開始實作一個完整的 Nginx Operator,應用前面學到的所有概念。

9.1 專案目標與架構

9.1.1 功能需求

我們的 Nginx Operator 需要支援:

  1. 自動部署 Nginx:根據 CR 創建 Deployment

  2. 配置管理:支援自定義 nginx.conf

  3. 服務暴露:自動創建 Service

  4. 配置熱更新:ConfigMap 變更後自動觸發滾動更新

  5. 版本管理:支援 Nginx 版本升級

  6. 健康檢查:配置 liveness 和 readiness probe

9.1.2 CRD 設計

API 定義 (apis/v1alpha1/nginx_types.go):

package v1alpha1

import (
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// NginxSpec defines the desired state of Nginx
type NginxSpec struct {
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    // +kubebuilder:default=1
    Replicas *int32 `json:"replicas,omitempty"`

    // +kubebuilder:validation:Pattern=`^nginx:.*$`
    // +kubebuilder:default="nginx:1.25"
    Image string `json:"image,omitempty"`

    // +optional
    Resources corev1.ResourceRequirements `json:"resources,omitempty"`

    // +optional
    Config string `json:"config,omitempty"`

    // +optional
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    // +kubebuilder:default=80
    Port int32 `json:"port,omitempty"`

    // +optional
    // +kubebuilder:validation:Enum=ClusterIP;NodePort;LoadBalancer
    // +kubebuilder:default=ClusterIP
    ServiceType corev1.ServiceType `json:"serviceType,omitempty"`
}

// NginxStatus defines the observed state of Nginx
type NginxStatus struct {
    // Conditions represent the latest available observations of the Nginx state
    // +optional
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // ReadyReplicas is the number of ready replicas
    // +optional
    ReadyReplicas int32 `json:"readyReplicas,omitempty"`

    // ConfigVersion tracks the version of the ConfigMap
    // +optional
    ConfigVersion string `json:"configVersion,omitempty"`

    // LastUpdateTime is the last time the status was updated
    // +optional
    LastUpdateTime *metav1.Time `json:"lastUpdateTime,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.readyReplicas
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

// Nginx is the Schema for the nginxes API
type Nginx struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   NginxSpec   `json:"spec,omitempty"`
    Status NginxStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// NginxList contains a list of Nginx
type NginxList struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ListMeta `json:"metadata,omitempty"`
    Items           []Nginx `json:"items"`
}

func init() {
    SchemeBuilder.Register(&Nginx{}, &NginxList{})
}

關鍵註解說明

  • +kubebuilder:subresource:status:啟用 status subresource

  • +kubebuilder:subresource:scale:支援 kubectl scale 命令

  • +kubebuilder:printcolumn:自定義 kubectl get 輸出列

  • +kubebuilder:validation:欄位驗證規則

9.2 Controller 實作

9.2.1 Reconciler 主邏輯

Controller (internal/controller/nginx_controller.go):

package controller

import (
    "context"
    "crypto/sha256"
    "fmt"
    "time"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    apierrors "k8s.io/apimachinery/pkg/api/errors"
    "k8s.io/apimachinery/pkg/api/meta"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    "sigs.k8s.io/controller-runtime/pkg/log"

    webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)

const (
    nginxFinalizer = "nginx.webapp.example.com/finalizer"

    // Condition types
    TypeAvailable   = "Available"
    TypeProgressing = "Progressing"
    TypeDegraded    = "Degraded"
)

// NginxReconciler reconciles a Nginx object
type NginxReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)
    logger.Info("Reconciling Nginx", "namespace", req.Namespace, "name", req.Name)

    // 1. 獲取 Nginx 資源
    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Nginx resource not found, likely deleted")
            return ctrl.Result{}, nil
        }
        logger.Error(err, "Failed to get Nginx")
        return ctrl.Result{}, err
    }

    // 2. 處理刪除邏輯 (Finalizer)
    if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
        return r.reconcileDelete(ctx, nginx)
    }

    // 3. 添加 Finalizer
    if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
        controllerutil.AddFinalizer(nginx, nginxFinalizer)
        if err := r.Update(ctx, nginx); err != nil {
            logger.Error(err, "Failed to add finalizer")
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    }

    // 4. Reconcile ConfigMap
    configMap, configVersion, err := r.reconcileConfigMap(ctx, nginx)
    if err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ConfigMapFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 5. Reconcile Deployment
    if err := r.reconcileDeployment(ctx, nginx, configVersion); err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "DeploymentFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 6. Reconcile Service
    if err := r.reconcileService(ctx, nginx); err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ServiceFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 7. 更新 Status
    if err := r.updateStatus(ctx, nginx, configVersion); err != nil {
        logger.Error(err, "Failed to update status")
        return ctrl.Result{}, err
    }

    // 8. 設置成功 Condition
    r.setCondition(nginx, TypeAvailable, metav1.ConditionTrue, "ReconcileSuccess", "Nginx is available")
    r.setCondition(nginx, TypeDegraded, metav1.ConditionFalse, "ReconcileSuccess", "")
    if err := r.Status().Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }

    logger.Info("Successfully reconciled Nginx")
    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

// reconcileConfigMap 管理 ConfigMap
func (r *NginxReconciler) reconcileConfigMap(ctx context.Context, nginx *webappv1alpha1.Nginx) (*corev1.ConfigMap, string, error) {
    logger := log.FromContext(ctx)

    // 默認 nginx 配置
    defaultConfig := `
events {
    worker_connections 1024;
}

http {
    server {
        listen 80;
        location / {
            root /usr/share/nginx/html;
            index index.html;
        }
    }
}`

    config := nginx.Spec.Config
    if config == "" {
        config = defaultConfig
    }

    // 計算配置的版本(hash)
    hash := sha256.Sum256([]byte(config))
    configVersion := fmt.Sprintf("%x", hash[:8])

    configMap := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name + "-config",
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":     "nginx",
                "nginx":   nginx.Name,
                "version": configVersion,
            },
        },
        Data: map[string]string{
            "nginx.conf": config,
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, configMap, r.Scheme); err != nil {
        return nil, "", err
    }

    // 創建或更新 ConfigMap
    existing := &corev1.ConfigMap{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      configMap.Name,
        Namespace: configMap.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating ConfigMap", "name", configMap.Name)
            if err := r.Create(ctx, configMap); err != nil {
                return nil, "", err
            }
            return configMap, configVersion, nil
        }
        return nil, "", err
    }

    // 更新 ConfigMap
    existing.Data = configMap.Data
    existing.Labels = configMap.Labels
    logger.Info("Updating ConfigMap", "name", configMap.Name)
    if err := r.Update(ctx, existing); err != nil {
        return nil, "", err
    }

    return existing, configVersion, nil
}

// reconcileDeployment 管理 Deployment
func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
    logger := log.FromContext(ctx)

    replicas := int32(1)
    if nginx.Spec.Replicas != nil {
        replicas = *nginx.Spec.Replicas
    }

    image := "nginx:1.25"
    if nginx.Spec.Image != "" {
        image = nginx.Spec.Image
    }

    port := int32(80)
    if nginx.Spec.Port != 0 {
        port = nginx.Spec.Port
    }

    deployment := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name,
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: &replicas,
            Selector: &metav1.LabelSelector{
                MatchLabels: map[string]string{
                    "app":   "nginx",
                    "nginx": nginx.Name,
                },
            },
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels: map[string]string{
                        "app":           "nginx",
                        "nginx":         nginx.Name,
                        "config-version": configVersion,  // 配置版本變更時觸發滾動更新
                    },
                },
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{
                        {
                            Name:  "nginx",
                            Image: image,
                            Ports: []corev1.ContainerPort{
                                {
                                    Name:          "http",
                                    ContainerPort: port,
                                    Protocol:      corev1.ProtocolTCP,
                                },
                            },
                            VolumeMounts: []corev1.VolumeMount{
                                {
                                    Name:      "config",
                                    MountPath: "/etc/nginx/nginx.conf",
                                    SubPath:   "nginx.conf",
                                },
                            },
                            Resources: nginx.Spec.Resources,
                            LivenessProbe: &corev1.Probe{
                                ProbeHandler: corev1.ProbeHandler{
                                    HTTPGet: &corev1.HTTPGetAction{
                                        Path: "/",
                                        Port: intstr.FromInt(int(port)),
                                    },
                                },
                                InitialDelaySeconds: 10,
                                PeriodSeconds:       10,
                            },
                            ReadinessProbe: &corev1.Probe{
                                ProbeHandler: corev1.ProbeHandler{
                                    HTTPGet: &corev1.HTTPGetAction{
                                        Path: "/",
                                        Port: intstr.FromInt(int(port)),
                                    },
                                },
                                InitialDelaySeconds: 5,
                                PeriodSeconds:       5,
                            },
                        },
                    },
                    Volumes: []corev1.Volume{
                        {
                            Name: "config",
                            VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                    LocalObjectReference: corev1.LocalObjectReference{
                                        Name: nginx.Name + "-config",
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
        return err
    }

    // 創建或更新 Deployment
    existing := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      deployment.Name,
        Namespace: deployment.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating Deployment", "name", deployment.Name)
            return r.Create(ctx, deployment)
        }
        return err
    }

    // 更新 Deployment
    existing.Spec = deployment.Spec
    logger.Info("Updating Deployment", "name", deployment.Name)
    return r.Update(ctx, existing)
}

// reconcileService 管理 Service
func (r *NginxReconciler) reconcileService(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    logger := log.FromContext(ctx)

    port := int32(80)
    if nginx.Spec.Port != 0 {
        port = nginx.Spec.Port
    }

    serviceType := corev1.ServiceTypeClusterIP
    if nginx.Spec.ServiceType != "" {
        serviceType = nginx.Spec.ServiceType
    }

    service := &corev1.Service{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name,
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
        },
        Spec: corev1.ServiceSpec{
            Type: serviceType,
            Selector: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
            Ports: []corev1.ServicePort{
                {
                    Name:       "http",
                    Protocol:   corev1.ProtocolTCP,
                    Port:       port,
                    TargetPort: intstr.FromInt(int(port)),
                },
            },
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, service, r.Scheme); err != nil {
        return err
    }

    // 創建或更新 Service
    existing := &corev1.Service{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      service.Name,
        Namespace: service.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating Service", "name", service.Name)
            return r.Create(ctx, service)
        }
        return err
    }

    // 更新 Service(保留 ClusterIP)
    existing.Spec.Ports = service.Spec.Ports
    existing.Spec.Selector = service.Spec.Selector
    existing.Spec.Type = service.Spec.Type
    logger.Info("Updating Service", "name", service.Name)
    return r.Update(ctx, existing)
}

// updateStatus 更新 Nginx 狀態
func (r *NginxReconciler) updateStatus(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
    // 獲取 Deployment 狀態
    deployment := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      nginx.Name,
        Namespace: nginx.Namespace,
    }, deployment)
    if err != nil {
        return err
    }

    // 更新狀態
    nginx.Status.ReadyReplicas = deployment.Status.ReadyReplicas
    nginx.Status.ConfigVersion = configVersion
    now := metav1.Now()
    nginx.Status.LastUpdateTime = &now

    // 設置 Progressing condition
    if deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
        r.setCondition(nginx, TypeProgressing, metav1.ConditionTrue, "Updating", "Deployment is being updated")
    } else {
        r.setCondition(nginx, TypeProgressing, metav1.ConditionFalse, "Updated", "All replicas are updated")
    }

    return r.Status().Update(ctx, nginx)
}

// reconcileDelete 處理刪除邏輯
func (r *NginxReconciler) reconcileDelete(ctx context.Context, nginx *webappv1alpha1.Nginx) (ctrl.Result, error) {
    logger := log.FromContext(ctx)
    logger.Info("Deleting Nginx", "name", nginx.Name)

    if controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
        // 執行清理邏輯(例如清理外部資源)
        logger.Info("Performing cleanup for Nginx", "name", nginx.Name)

        // 移除 finalizer
        controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
        if err := r.Update(ctx, nginx); err != nil {
            return ctrl.Result{}, err
        }
    }

    return ctrl.Result{}, nil
}

// setCondition 設置 Condition
func (r *NginxReconciler) setCondition(nginx *webappv1alpha1.Nginx, condType string, status metav1.ConditionStatus, reason, message string) {
    condition := metav1.Condition{
        Type:               condType,
        Status:             status,
        Reason:             reason,
        Message:            message,
        LastTransitionTime: metav1.Now(),
        ObservedGeneration: nginx.Generation,
    }
    meta.SetStatusCondition(&nginx.Status.Conditions, condition)
}

// SetupWithManager sets up the controller with the Manager.
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Service{}).
        Owns(&corev1.ConfigMap{}).
        Complete(r)
}

9.2.2 核心概念詳解

1. ConfigMap 版本管理

// 計算配置的 SHA256 hash 作為版本號
hash := sha256.Sum256([]byte(config))
configVersion := fmt.Sprintf("%x", hash[:8])

// 將版本號添加到 Pod labels
Labels: map[string]string{
    "config-version": configVersion,  // 版本變更時自動觸發 rolling update
}

2. Owner Reference 自動清理

// 設置 Owner Reference 後,刪除 Nginx 時會自動刪除子資源
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

3. Finalizer 清理邏輯

// 添加 finalizer 防止資源被立即刪除
if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
    controllerutil.AddFinalizer(nginx, nginxFinalizer)
    if err := r.Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }
}

// 刪除時執行清理
if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
    // 執行清理邏輯
    // ...

    // 移除 finalizer 允許刪除
    controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
    if err := r.Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }
}

9.3 使用範例

9.3.1 部署 Operator

# 1. 生成 CRD 和 RBAC manifests
make manifests

# 2. 安裝 CRD
make install

# 3. 運行 operator(開發模式)
make run

# 或者部署到集群
make docker-build docker-push IMG=your-registry/nginx-operator:v1.0.0
make deploy IMG=your-registry/nginx-operator:v1.0.0

9.3.2 創建 Nginx 實例

基本範例 (config/samples/nginx_basic.yaml):

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
  image: nginx:1.25
  port: 80
  serviceType: ClusterIP

自定義配置範例 (config/samples/nginx_custom.yaml):

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-custom
  namespace: default
spec:
  replicas: 2
  image: nginx:1.25
  port: 8080
  serviceType: LoadBalancer
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "256Mi"
      cpu: "200m"
  config: |
    events {
        worker_connections 2048;
    }

    http {
        log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';

        access_log /var/log/nginx/access.log main;

        upstream backend {
            server backend-1.example.com;
            server backend-2.example.com;
        }

        server {
            listen 8080;

            location / {
                proxy_pass http://backend;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
            }

            location /health {
                access_log off;
                return 200 "healthy\n";
            }
        }
    }

9.3.3 測試和驗證

# 1. 創建 Nginx 實例
kubectl apply -f config/samples/nginx_basic.yaml

# 2. 查看 Nginx 狀態
kubectl get nginx nginx-sample

# 輸出:
# NAME           REPLICAS   READY   IMAGE         AGE
# nginx-sample   3          3       nginx:1.25    2m

# 3. 查看詳細狀態
kubectl describe nginx nginx-sample

# 4. 查看生成的資源
kubectl get deployment,svc,cm -l nginx=nginx-sample

# 5. 測試服務
kubectl port-forward svc/nginx-sample 8080:80
curl http://localhost:8080

# 6. 更新配置(觸發滾動更新)
kubectl edit nginx nginx-sample
# 修改 spec.config 或 spec.replicas

# 7. 觀察滾動更新
kubectl rollout status deployment/nginx-sample

# 8. 查看 Conditions
kubectl get nginx nginx-sample -o jsonpath='{.status.conditions}' | jq

# 9. Scale 測試
kubectl scale nginx nginx-sample --replicas=5
kubectl get nginx nginx-sample

# 10. 刪除測試
kubectl delete nginx nginx-sample
# 驗證子資源被自動清理
kubectl get deployment,svc,cm -l nginx=nginx-sample

9.4 進階功能實作

9.4.1 支援 Ingress 自動創建

擴展 CRD

type NginxSpec struct {
    // ... 現有字段

    // +optional
    Ingress *IngressSpec `json:"ingress,omitempty"`
}

type IngressSpec struct {
    // +kubebuilder:validation:Required
    Host string `json:"host"`

    // +optional
    TLS bool `json:"tls,omitempty"`

    // +optional
    Annotations map[string]string `json:"annotations,omitempty"`
}

Controller 添加 Ingress reconcile

func (r *NginxReconciler) reconcileIngress(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    if nginx.Spec.Ingress == nil {
        return nil
    }

    pathType := networkingv1.PathTypePrefix
    ingress := &networkingv1.Ingress{
        ObjectMeta: metav1.ObjectMeta{
            Name:        nginx.Name,
            Namespace:   nginx.Namespace,
            Annotations: nginx.Spec.Ingress.Annotations,
        },
        Spec: networkingv1.IngressSpec{
            Rules: []networkingv1.IngressRule{
                {
                    Host: nginx.Spec.Ingress.Host,
                    IngressRuleValue: networkingv1.IngressRuleValue{
                        HTTP: &networkingv1.HTTPIngressRuleValue{
                            Paths: []networkingv1.HTTPIngressPath{
                                {
                                    Path:     "/",
                                    PathType: &pathType,
                                    Backend: networkingv1.IngressBackend{
                                        Service: &networkingv1.IngressServiceBackend{
                                            Name: nginx.Name,
                                            Port: networkingv1.ServiceBackendPort{
                                                Number: nginx.Spec.Port,
                                            },
                                        },
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    if nginx.Spec.Ingress.TLS {
        ingress.Spec.TLS = []networkingv1.IngressTLS{
            {
                Hosts:      []string{nginx.Spec.Ingress.Host},
                SecretName: nginx.Name + "-tls",
            },
        }
    }

    if err := controllerutil.SetControllerReference(nginx, ingress, r.Scheme); err != nil {
        return err
    }

    // 創建或更新邏輯...
    return nil
}

9.4.2 支援多端口暴露

擴展 CRD

type PortSpec struct {
    // +kubebuilder:validation:Required
    Name string `json:"name"`

    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    Port int32 `json:"port"`

    // +optional
    // +kubebuilder:default="TCP"
    Protocol corev1.Protocol `json:"protocol,omitempty"`
}

type NginxSpec struct {
    // ... 現有字段

    // +optional
    AdditionalPorts []PortSpec `json:"additionalPorts,omitempty"`
}

Controller 處理多端口

func (r *NginxReconciler) buildServicePorts(nginx *webappv1alpha1.Nginx) []corev1.ServicePort {
    ports := []corev1.ServicePort{
        {
            Name:       "http",
            Port:       nginx.Spec.Port,
            TargetPort: intstr.FromInt(int(nginx.Spec.Port)),
            Protocol:   corev1.ProtocolTCP,
        },
    }

    for _, p := range nginx.Spec.AdditionalPorts {
        ports = append(ports, corev1.ServicePort{
            Name:       p.Name,
            Port:       p.Port,
            TargetPort: intstr.FromInt(int(p.Port)),
            Protocol:   p.Protocol,
        })
    }

    return ports
}

9.5 完整測試範例

9.5.1 單元測試

Controller 測試 (internal/controller/nginx_controller_test.go):

package controller

import (
    "context"
    "time"

    . "github.com/onsi/ginkgo/v2"
    . "github.com/onsi/gomega"
    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"

    webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)

var _ = Describe("Nginx Controller", func() {
    const (
        NginxName      = "test-nginx"
        NginxNamespace = "default"
        timeout        = time.Second * 10
        interval       = time.Millisecond * 250
    )

    Context("When creating a Nginx resource", func() {
        It("Should create Deployment, Service, and ConfigMap", func() {
            ctx := context.Background()

            nginx := &webappv1alpha1.Nginx{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                },
                Spec: webappv1alpha1.NginxSpec{
                    Replicas: pointer.Int32(2),
                    Image:    "nginx:1.25",
                    Port:     80,
                },
            }

            Expect(k8sClient.Create(ctx, nginx)).Should(Succeed())

            // 驗證 Deployment 被創建
            deployment := &appsv1.Deployment{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(*deployment.Spec.Replicas).Should(Equal(int32(2)))
            Expect(deployment.Spec.Template.Spec.Containers[0].Image).Should(Equal("nginx:1.25"))

            // 驗證 Service 被創建
            service := &corev1.Service{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, service)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(service.Spec.Ports[0].Port).Should(Equal(int32(80)))

            // 驗證 ConfigMap 被創建
            configMap := &corev1.ConfigMap{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName + "-config",
                    Namespace: NginxNamespace,
                }, configMap)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(configMap.Data).Should(HaveKey("nginx.conf"))
        })
    })

    Context("When updating Nginx config", func() {
        It("Should trigger rolling update", func() {
            ctx := context.Background()

            // 獲取當前 Nginx
            nginx := &webappv1alpha1.Nginx{}
            Expect(k8sClient.Get(ctx, types.NamespacedName{
                Name:      NginxName,
                Namespace: NginxNamespace,
            }, nginx)).Should(Succeed())

            // 更新配置
            nginx.Spec.Config = "# updated config"
            Expect(k8sClient.Update(ctx, nginx)).Should(Succeed())

            // 驗證 ConfigMap 被更新
            configMap := &corev1.ConfigMap{}
            Eventually(func() string {
                k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName + "-config",
                    Namespace: NginxNamespace,
                }, configMap)
                return configMap.Data["nginx.conf"]
            }, timeout, interval).Should(ContainSubstring("updated config"))

            // 驗證 Deployment Pod template 的 config-version label 改變
            deployment := &appsv1.Deployment{}
            oldVersion := ""
            Eventually(func() bool {
                k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                newVersion := deployment.Spec.Template.Labels["config-version"]
                if oldVersion == "" {
                    oldVersion = newVersion
                    return false
                }
                return newVersion != oldVersion
            }, timeout, interval).Should(BeTrue())
        })
    })

    Context("When deleting Nginx", func() {
        It("Should cleanup all resources", func() {
            ctx := context.Background()

            nginx := &webappv1alpha1.Nginx{}
            Expect(k8sClient.Get(ctx, types.NamespacedName{
                Name:      NginxName,
                Namespace: NginxNamespace,
            }, nginx)).Should(Succeed())

            Expect(k8sClient.Delete(ctx, nginx)).Should(Succeed())

            // 驗證所有子資源被刪除
            Eventually(func() bool {
                deployment := &appsv1.Deployment{}
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                return errors.IsNotFound(err)
            }, timeout, interval).Should(BeTrue())
        })
    })
})

9.5.2 E2E 測試(Chainsaw)

測試套件 (tests/e2e/nginx/chainsaw-test.yaml):

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: nginx-e2e
spec:
  timeouts:
    apply: 30s
    assert: 1m
    cleanup: 30s
  steps:
    - name: Create Nginx instance
      try:
        - apply:
            file: 00-nginx-sample.yaml
        - assert:
            file: 01-assert-nginx-ready.yaml

    - name: Test service connectivity
      try:
        - script:
            content: |
              kubectl run curl-test --image=curlimages/curl:latest --rm -i --restart=Never -- \
                curl -s http://nginx-sample.default.svc.cluster.local
            check:
              ($error == null): true

    - name: Update configuration
      try:
        - apply:
            file: 02-update-config.yaml
        - assert:
            file: 03-assert-rolling-update.yaml

    - name: Scale up
      try:
        - script:
            content: |
              kubectl scale nginx nginx-sample --replicas=5
        - assert:
            file: 04-assert-scaled.yaml

    - name: Cleanup
      try:
        - delete:
            ref:
              apiVersion: webapp.example.com/v1alpha1
              kind: Nginx
              name: nginx-sample
        - assert:
            file: 05-assert-deleted.yaml

00-nginx-sample.yaml

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
  image: nginx:1.25
  port: 80
  serviceType: ClusterIP

01-assert-nginx-ready.yaml

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
status:
  readyReplicas: 3
  conditions:
    - type: Available
      status: "True"
      reason: ReconcileSuccess
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
status:
  readyReplicas: 3
  updatedReplicas: 3
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-sample
  namespace: default
spec:
  type: ClusterIP
  ports:
    - port: 80

9.6 部署到生產環境

9.6.1 Kustomize 配置

Base (config/default/kustomization.yaml):

namePrefix: nginx-operator-
namespace: nginx-operator-system

resources:
  - ../crd
  - ../rbac
  - ../manager

images:
  - name: controller
    newName: your-registry/nginx-operator
    newTag: v1.0.0

Overlay for Production (config/overlays/production/kustomization.yaml):

bases:
  - ../../default

patchesStrategicMerge:
  - manager_resources.yaml

configMapGenerator:
  - name: manager-config
    literals:
      - LOG_LEVEL=info
      - ENABLE_WEBHOOKS=true

manager_resources.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: manager
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

9.6.2 部署命令

# 構建鏡像
make docker-build IMG=your-registry/nginx-operator:v1.0.0

# 推送鏡像
make docker-push IMG=your-registry/nginx-operator:v1.0.0

# 部署到生產環境
kubectl apply -k config/overlays/production

# 驗證部署
kubectl get deployment -n nginx-operator-system
kubectl get pods -n nginx-operator-system

# 查看日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager -f

十、進階主題與最佳實踐

10.1 性能優化

10.1.1 使用 Field Indexer 加速查詢

當需要頻繁根據特定字段查詢資源時,Field Indexer 可以大幅提升性能。

範例:根據 Owner 快速查詢 Pod

參考 OpenTelemetry Operator 的實作 (main.go:126-139):

// 為 Pod 添加 owner.name 索引
if err := mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.Pod{},
    "metadata.ownerReferences.name",
    func(rawObj client.Object) []string {
        pod := rawObj.(*corev1.Pod)
        var owners []string
        for _, ref := range pod.GetOwnerReferences() {
            owners = append(owners, ref.Name)
        }
        return owners
    },
); err != nil {
    setupLog.Error(err, "failed to create pod index")
    os.Exit(1)
}

// 使用索引快速查詢
pods := &corev1.PodList{}
err := r.List(ctx, pods, client.MatchingFields{
    "metadata.ownerReferences.name": collectorName,
})

更多索引範例

// 1. 索引 ConfigMap 的標籤
mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.ConfigMap{},
    "metadata.labels.app",
    func(obj client.Object) []string {
        cm := obj.(*corev1.ConfigMap)
        if app, ok := cm.Labels["app"]; ok {
            return []string{app}
        }
        return nil
    },
)

// 查詢
configMaps := &corev1.ConfigMapList{}
r.List(ctx, configMaps, client.MatchingFields{
    "metadata.labels.app": "nginx",
})

// 2. 索引 Pod 的 Node
mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.Pod{},
    "spec.nodeName",
    func(obj client.Object) []string {
        pod := obj.(*corev1.Pod)
        if pod.Spec.NodeName != "" {
            return []string{pod.Spec.NodeName}
        }
        return nil
    },
)

// 查詢特定 Node 上的 Pod
pods := &corev1.PodList{}
r.List(ctx, pods, client.MatchingFields{
    "spec.nodeName": "node-1",
})

10.1.2 使用 Cache 減少 API 調用

Controller Runtime 自動提供 cache,但需要正確配置:

// main.go - 配置 Cache
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    Cache: cache.Options{
        // 限制 cache 的 namespace(多租戶場景)
        Namespaces: []string{"namespace1", "namespace2"},

        // 配置同步週期
        SyncPeriod: pointer.Duration(10 * time.Minute),
    },
    // 配置 metrics 和 health 端點
    Metrics: server.Options{
        BindAddress: ":8080",
    },
    HealthProbeBindAddress: ":8081",
})

// Controller 中使用 cache(自動)
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // r.Get 自動從 cache 讀取
    nginx := &v1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        return ctrl.Result{}, err
    }

    // r.List 也從 cache 讀取
    deployments := &appsv1.DeploymentList{}
    if err := r.List(ctx, deployments, client.InNamespace(req.Namespace)); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

10.1.3 優化 Reconcile 邏輯

1. 使用 Predicate 過濾不必要的事件

import (
    "sigs.k8s.io/controller-runtime/pkg/predicate"
)

func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Service{}).
        WithEventFilter(predicate.Funcs{
            // 忽略 status-only 更新
            UpdateFunc: func(e event.UpdateEvent) bool {
                oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
                newObj := e.ObjectNew.(*webappv1alpha1.Nginx)

                // 只在 spec 或 metadata 改變時 reconcile
                return oldObj.Generation != newObj.Generation ||
                       !reflect.DeepEqual(oldObj.Labels, newObj.Labels) ||
                       !reflect.DeepEqual(oldObj.Annotations, newObj.Annotations)
            },
            // 忽略刪除事件(由 finalizer 處理)
            DeleteFunc: func(e event.DeleteEvent) bool {
                return false
            },
        }).
        Complete(r)
}

2. 批量處理更新

func (r *NginxReconciler) reconcileMultipleResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 並行處理多個資源
    var wg sync.WaitGroup
    errCh := make(chan error, 3)

    wg.Add(3)

    // ConfigMap
    go func() {
        defer wg.Done()
        if _, _, err := r.reconcileConfigMap(ctx, nginx); err != nil {
            errCh <- fmt.Errorf("configmap: %w", err)
        }
    }()

    // Deployment
    go func() {
        defer wg.Done()
        if err := r.reconcileDeployment(ctx, nginx, ""); err != nil {
            errCh <- fmt.Errorf("deployment: %w", err)
        }
    }()

    // Service
    go func() {
        defer wg.Done()
        if err := r.reconcileService(ctx, nginx); err != nil {
            errCh <- fmt.Errorf("service: %w", err)
        }
    }()

    wg.Wait()
    close(errCh)

    // 收集錯誤
    for err := range errCh {
        if err != nil {
            return err
        }
    }

    return nil
}

3. 使用 Generation 判斷資源變更

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 如果 ObservedGeneration 和 Generation 相同,說明 spec 沒變
    if nginx.Status.ObservedGeneration == nginx.Generation {
        // 只檢查子資源狀態,不重新創建
        return r.reconcileStatus(ctx, nginx)
    }

    // spec 改變了,執行完整 reconcile
    return r.fullReconcile(ctx, nginx)
}

10.1.4 限制並發和 Rate Limiting

// main.go - 配置 Controller 選項
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        // ...
    })

    if err = (&controller.NginxReconciler{
        Client: mgr.GetClient(),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller")
        os.Exit(1)
    }
}

// SetupWithManager - 配置並發和限流
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        WithOptions(controller.Options{
            // 最大並發 reconcile 數
            MaxConcurrentReconciles: 3,

            // Rate Limiter
            RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
                5*time.Millisecond,  // base delay
                1000*time.Second,    // max delay
            ),
        }).
        Complete(r)
}

10.2 安全最佳實踐

10.2.1 RBAC 最小權限原則

僅授予必要權限

// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete

// 不要使用通配符
// 錯誤示範:
// +kubebuilder:rbac:groups=*,resources=*,verbs=*

使用 ServiceAccount 隔離

# config/rbac/service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nginx-operator
  namespace: nginx-operator-system
---
# config/rbac/role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nginx-operator-manager-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nginx-operator-manager-role
subjects:
  - kind: ServiceAccount
    name: nginx-operator
    namespace: nginx-operator-system

10.2.2 Webhook 安全

配置 TLS 和證書管理

# config/certmanager/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: nginx-operator-serving-cert
  namespace: nginx-operator-system
spec:
  dnsNames:
    - nginx-operator-webhook-service.nginx-operator-system.svc
    - nginx-operator-webhook-service.nginx-operator-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: nginx-operator-selfsigned-issuer
  secretName: webhook-server-cert
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: nginx-operator-selfsigned-issuer
  namespace: nginx-operator-system
spec:
  selfSigned: {}

Validating Webhook 範例

// +kubebuilder:webhook:path=/validate-webapp-example-com-v1alpha1-nginx,mutating=false,failurePolicy=fail,groups=webapp.example.com,resources=nginxes,verbs=create;update,versions=v1alpha1,name=vnginx.kb.io,sideEffects=None,admissionReviewVersions=v1

func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
    // 驗證 replicas 範圍
    if r.Spec.Replicas != nil && (*r.Spec.Replicas < 1 || *r.Spec.Replicas > 10) {
        return nil, fmt.Errorf("replicas must be between 1 and 10")
    }

    // 驗證 image 格式
    if !strings.HasPrefix(r.Spec.Image, "nginx:") {
        return nil, fmt.Errorf("image must start with 'nginx:'")
    }

    // 驗證 config 語法(可選)
    if r.Spec.Config != "" {
        if err := validateNginxConfig(r.Spec.Config); err != nil {
            return nil, fmt.Errorf("invalid nginx config: %w", err)
        }
    }

    return nil, nil
}

func (r *Nginx) ValidateUpdate(old runtime.Object) (admission.Warnings, error) {
    oldNginx := old.(*Nginx)

    // 防止降級(可選規則)
    if r.Spec.Image != "" && oldNginx.Spec.Image != "" {
        oldVersion := extractVersion(oldNginx.Spec.Image)
        newVersion := extractVersion(r.Spec.Image)
        if newVersion < oldVersion {
            return admission.Warnings{"Downgrading nginx version may cause issues"}, nil
        }
    }

    return r.ValidateCreate()
}

10.2.3 Secret 管理

從 Secret 讀取敏感配置

func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 從 Secret 掛載 TLS 證書
    volumes := []corev1.Volume{
        {
            Name: "config",
            VolumeSource: corev1.VolumeSource{
                ConfigMap: &corev1.ConfigMapVolumeSource{
                    LocalObjectReference: corev1.LocalObjectReference{
                        Name: nginx.Name + "-config",
                    },
                },
            },
        },
    }

    volumeMounts := []corev1.VolumeMount{
        {
            Name:      "config",
            MountPath: "/etc/nginx/nginx.conf",
            SubPath:   "nginx.conf",
        },
    }

    // 如果配置了 TLS
    if nginx.Spec.TLS != nil {
        volumes = append(volumes, corev1.Volume{
            Name: "tls",
            VolumeSource: corev1.VolumeSource{
                Secret: &corev1.SecretVolumeSource{
                    SecretName: nginx.Spec.TLS.SecretName,
                },
            },
        })

        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      "tls",
            MountPath: "/etc/nginx/tls",
            ReadOnly:  true,
        })
    }

    // 使用在 Deployment spec 中...
}

加密 etcd 中的 Secret(集群級配置):

# /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
spec:
  containers:
    - name: kube-apiserver
      command:
        - kube-apiserver
        - --encryption-provider-config=/etc/kubernetes/encryption-config.yaml
      volumeMounts:
        - name: encryption-config
          mountPath: /etc/kubernetes/encryption-config.yaml
          readOnly: true
  volumes:
    - name: encryption-config
      hostPath:
        path: /etc/kubernetes/encryption-config.yaml

10.3 可觀測性

10.3.1 結構化日誌

使用 logr 進行結構化日誌記錄

import (
    "sigs.k8s.io/controller-runtime/pkg/log"
)

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)

    // 基本日誌
    logger.Info("Reconciling Nginx",
        "namespace", req.Namespace,
        "name", req.Name,
    )

    // 帶錯誤的日誌
    if err := r.Get(ctx, req.NamespacedName, &nginx); err != nil {
        logger.Error(err, "Failed to get Nginx",
            "namespace", req.Namespace,
            "name", req.Name,
        )
        return ctrl.Result{}, err
    }

    // Debug 級別日誌
    logger.V(1).Info("Detailed debug info",
        "spec", nginx.Spec,
        "status", nginx.Status,
    )

    // 使用 WithValues 添加上下文
    logger = logger.WithValues(
        "nginx-version", nginx.Spec.Image,
        "replicas", nginx.Spec.Replicas,
    )

    logger.Info("Starting reconciliation")

    return ctrl.Result{}, nil
}

配置日誌級別 (main.go):

import (
    "flag"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
)

func main() {
    var logLevel int
    flag.IntVar(&logLevel, "log-level", 0, "Log level (0=info, 1=debug, 2=trace)")

    opts := zap.Options{
        Development: true,
        Level:       zapcore.Level(-logLevel),
    }

    ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

    // ...
}

10.3.2 Metrics 暴露

使用 Prometheus metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
    nginxReconcileTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "nginx_operator_reconcile_total",
            Help: "Total number of reconciliations",
        },
        []string{"namespace", "name", "result"},
    )

    nginxReconcileDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "nginx_operator_reconcile_duration_seconds",
            Help:    "Duration of reconciliations",
            Buckets: prometheus.DefBuckets,
        },
        []string{"namespace", "name"},
    )

    nginxCount = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "nginx_operator_nginx_count",
            Help: "Number of Nginx instances",
        },
        []string{"namespace"},
    )
)

func init() {
    // 註冊 metrics
    metrics.Registry.MustRegister(
        nginxReconcileTotal,
        nginxReconcileDuration,
        nginxCount,
    )
}

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        nginxReconcileDuration.WithLabelValues(
            req.Namespace,
            req.Name,
        ).Observe(duration)
    }()

    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        nginxReconcileTotal.WithLabelValues(
            req.Namespace,
            req.Name,
            "error",
        ).Inc()
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // ... reconcile 邏輯

    nginxReconcileTotal.WithLabelValues(
        req.Namespace,
        req.Name,
        "success",
    ).Inc()

    nginxCount.WithLabelValues(req.Namespace).Set(1)

    return ctrl.Result{}, nil
}

配置 ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-operator-metrics
  namespace: nginx-operator-system
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

10.3.3 健康檢查

配置 Health 和 Readiness Probes (main.go):

import (
    "sigs.k8s.io/controller-runtime/pkg/healthz"
)

func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        HealthProbeBindAddress: ":8081",
        // ...
    })

    // 添加 health check
    if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up health check")
        os.Exit(1)
    }

    // 添加 readiness check
    if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up ready check")
        os.Exit(1)
    }

    // ...
}

Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  template:
    spec:
      containers:
        - name: manager
          image: nginx-operator:latest
          ports:
            - containerPort: 8080
              name: metrics
            - containerPort: 8081
              name: health
          livenessProbe:
            httpGet:
              path: /healthz
              port: health
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /readyz
              port: health
            initialDelaySeconds: 5
            periodSeconds: 10

10.4 生產環境部署

10.4.1 高可用性配置

多副本 + Leader Election

// main.go
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        // 啟用 Leader Election
        LeaderElection:          true,
        LeaderElectionID:        "nginx-operator-lock",
        LeaderElectionNamespace: "nginx-operator-system",

        // 配置 lease 參數
        LeaseDuration: pointer.Duration(15 * time.Second),
        RenewDeadline: pointer.Duration(10 * time.Second),
        RetryPeriod:   pointer.Duration(2 * time.Second),
    })

    // ...
}

Deployment 高可用配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  replicas: 3
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
    spec:
      # Pod 反親和性 - 分散到不同節點
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    control-plane: controller-manager
                topologyKey: kubernetes.io/hostname

      # 優先級
      priorityClassName: system-cluster-critical

      containers:
        - name: manager
          image: nginx-operator:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

          # 安全上下文
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsNonRoot: true
            runAsUser: 65532
            seccompProfile:
              type: RuntimeDefault

10.4.2 資源限制

配置 Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: nginx-operator-quota
  namespace: nginx-operator-system
spec:
  hard:
    requests.cpu: "2"
    requests.memory: "2Gi"
    limits.cpu: "4"
    limits.memory: "4Gi"
    persistentvolumeclaims: "10"

配置 LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: nginx-operator-limitrange
  namespace: nginx-operator-system
spec:
  limits:
    - max:
        cpu: "1"
        memory: "1Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      default:
        cpu: "200m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

10.4.3 監控告警

Prometheus 告警規則

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nginx-operator-alerts
  namespace: nginx-operator-system
spec:
  groups:
    - name: nginx-operator
      interval: 30s
      rules:
        - alert: NginxOperatorDown
          expr: up{job="nginx-operator-metrics"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Nginx Operator is down"
            description: "Nginx Operator has been down for more than 5 minutes"

        - alert: NginxOperatorHighReconcileErrors
          expr: rate(nginx_operator_reconcile_total{result="error"}[5m]) > 0.1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High reconcile error rate"
            description: "Nginx Operator has high reconcile error rate: {{ $value }}"

        - alert: NginxInstanceNotReady
          expr: nginx_operator_nginx_count{} - nginx_operator_nginx_ready{} > 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Nginx instance not ready"
            description: "Nginx instance in {{ $labels.namespace }} is not ready for 15 minutes"

10.4.4 備份和恢復

備份 CRD 和 CR

#!/bin/bash
# backup-nginx-operator.sh

BACKUP_DIR="/backup/nginx-operator/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# 備份 CRD
kubectl get crd nginxes.webapp.example.com -o yaml > "$BACKUP_DIR/crd.yaml"

# 備份所有 Nginx CR
kubectl get nginx --all-namespaces -o yaml > "$BACKUP_DIR/nginx-crs.yaml"

# 備份 Operator 配置
kubectl get deployment,service,configmap,secret -n nginx-operator-system -o yaml > "$BACKUP_DIR/operator-resources.yaml"

echo "Backup completed: $BACKUP_DIR"

恢復流程

#!/bin/bash
# restore-nginx-operator.sh

BACKUP_DIR=$1

if [ -z "$BACKUP_DIR" ]; then
    echo "Usage: $0 <backup-directory>"
    exit 1
fi

# 1. 恢復 CRD
kubectl apply -f "$BACKUP_DIR/crd.yaml"

# 2. 恢復 Operator
kubectl apply -f "$BACKUP_DIR/operator-resources.yaml"

# 3. 等待 Operator ready
kubectl wait --for=condition=available --timeout=300s \
    deployment/nginx-operator-controller-manager -n nginx-operator-system

# 4. 恢復 Nginx CR
kubectl apply -f "$BACKUP_DIR/nginx-crs.yaml"

echo "Restore completed from: $BACKUP_DIR"

10.5 多租戶支援

10.5.1 Namespace 隔離

限制 Operator 監聽特定 Namespace

// main.go
func main() {
    // 從環境變數讀取允許的 namespaces
    watchNamespaces := os.Getenv("WATCH_NAMESPACES")
    var namespaces []string
    if watchNamespaces != "" {
        namespaces = strings.Split(watchNamespaces, ",")
    }

    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme: scheme,
        Cache: cache.Options{
            Namespaces: namespaces,  // 只 watch 這些 namespaces
        },
    })

    // ...
}

Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  template:
    spec:
      containers:
        - name: manager
          env:
            - name: WATCH_NAMESPACES
              value: "tenant-a,tenant-b,tenant-c"

10.5.2 資源配額

為每個租戶配置 ResourceQuota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-nginx-quota
  namespace: tenant-a
spec:
  hard:
    count/nginxes.webapp.example.com: "10"  # 最多 10 個 Nginx 實例
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values: ["tenant-a"]

10.5.3 驗證租戶權限

Validating Webhook 檢查權限

func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
    // 檢查租戶配額
    quota, err := getTenantQuota(r.Namespace)
    if err != nil {
        return nil, err
    }

    currentCount, err := getNginxCount(r.Namespace)
    if err != nil {
        return nil, err
    }

    if currentCount >= quota {
        return nil, fmt.Errorf(
            "tenant %s has reached quota limit: %d/%d",
            r.Namespace, currentCount, quota,
        )
    }

    // 檢查資源限制
    if r.Spec.Replicas != nil && *r.Spec.Replicas > quota.MaxReplicas {
        return nil, fmt.Errorf(
            "replicas %d exceeds tenant limit %d",
            *r.Spec.Replicas, quota.MaxReplicas,
        )
    }

    return nil, nil
}

十一、常見問題與調試技巧

11.1 常見錯誤及解決方案

11.1.1 CRD 相關錯誤

錯誤 1:CRD 未安裝

Error: the server could not find the requested resource (get nginxes.webapp.example.com)

解決方案

# 檢查 CRD 是否存在
kubectl get crd | grep nginx

# 安裝 CRD
make install

# 或手動安裝
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml

# 驗證
kubectl get crd nginxes.webapp.example.com -o yaml

錯誤 2:CRD 版本不匹配

error: error validating "nginx.yaml": error validating data: ValidationError(Nginx.spec): unknown field "newField"

解決方案

# 1. 更新 CRD 定義
make manifests

# 2. 重新安裝 CRD
kubectl replace -f config/crd/bases/webapp.example.com_nginxes.yaml

# 3. 如果 replace 失敗,需要先刪除(危險操作!會刪除所有 CR)
kubectl delete crd nginxes.webapp.example.com
make install

錯誤 3:Validation 規則不生效

# CRD 中定義了 validation,但創建時沒有被驗證
spec:
  replicas: 100  # 超過 maximum: 10 但沒報錯

解決方案

# 1. 確認 CRD 中包含 validation rules
kubectl get crd nginxes.webapp.example.com -o yaml | grep -A 10 validation

# 2. 重新生成並安裝 CRD
make manifests
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml

# 3. 驗證 validation 生效
kubectl apply -f - <<EOF
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: test
spec:
  replicas: 100  # 應該報錯
EOF

11.1.2 Controller 錯誤

錯誤 1:Reconcile 死循環

# 日誌顯示不停 reconcile
INFO    Reconciling Nginx
INFO    Reconciling Nginx
INFO    Reconciling Nginx
...

原因分析

  1. Status 更新觸發了 reconcile

  2. 沒有正確設置 Predicate 過濾

  3. Requeue 邏輯錯誤

解決方案

// 1. 使用 Predicate 過濾 status-only 更新
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        WithEventFilter(predicate.Funcs{
            UpdateFunc: func(e event.UpdateEvent) bool {
                oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
                newObj := e.ObjectNew.(*webappv1alpha1.Nginx)

                // 只在 spec 改變時 reconcile
                return oldObj.Generation != newObj.Generation
            },
        }).
        Complete(r)
}

// 2. 正確使用 Status().Update()
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // ... reconcile logic

    // 使用 Status() subresource 更新,不會觸發新的 reconcile
    if err := r.Status().Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil  // 不要返回 Requeue: true
}

錯誤 2:無法更新 Status

ERROR   Failed to update status    error="the server could not find the requested resource"

解決方案

// 確保 CRD 定義包含 status subresource
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status  // <-- 這一行很重要!

type Nginx struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   NginxSpec   `json:"spec,omitempty"`
    Status NginxStatus `json:"status,omitempty"`
}
# 重新生成並安裝 CRD
make manifests
make install

錯誤 3:Owner Reference 錯誤導致資源無法刪除

ERROR   Failed to delete Deployment    error="cannot delete resource, owner references are set"

解決方案

// 正確設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

// 刪除時會自動清理子資源(Garbage Collection)
// 如果需要手動清理:
func (r *NginxReconciler) cleanupResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 列出所有關聯資源
    deployments := &appsv1.DeploymentList{}
    if err := r.List(ctx, deployments, client.InNamespace(nginx.Namespace), client.MatchingLabels{
        "nginx": nginx.Name,
    }); err != nil {
        return err
    }

    // 刪除資源
    for _, deployment := range deployments.Items {
        if err := r.Delete(ctx, &deployment); err != nil {
            return err
        }
    }

    return nil
}

11.1.3 RBAC 權限錯誤

錯誤:403 Forbidden

ERROR   Failed to list Deployments    error="deployments.apps is forbidden: User \"system:serviceaccount:nginx-operator-system:default\" cannot list resource \"deployments\" in API group \"apps\" in the namespace \"default\""

解決方案

# 1. 檢查 RBAC 註解是否正確
# Controller 代碼中:
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete

# 2. 重新生成 RBAC manifests
make manifests

# 3. 檢查生成的 Role/ClusterRole
cat config/rbac/role.yaml

# 4. 重新部署 Operator
kubectl apply -k config/default

# 5. 驗證 ServiceAccount 權限
kubectl auth can-i list deployments \
    --as=system:serviceaccount:nginx-operator-system:nginx-operator-controller-manager \
    -n default

跨 Namespace 權限問題

# 如果 Operator 需要操作多個 namespace,需要使用 ClusterRole
# 修改 config/default/kustomization.yaml:

# 註釋掉這一行(使用 Role)
# - ../rbac/role.yaml
# - ../rbac/role_binding.yaml

# 使用 ClusterRole
- ../rbac/cluster_role.yaml
- ../rbac/cluster_role_binding.yaml

11.1.4 Webhook 錯誤

錯誤:Webhook 連接超時

Error from server (InternalError): Internal error occurred: failed calling webhook "vnginx.kb.io": Post "https://nginx-operator-webhook-service.nginx-operator-system.svc:443/validate-webapp-example-com-v1alpha1-nginx?timeout=10s": context deadline exceeded

解決方案

# 1. 檢查 webhook service 是否存在
kubectl get svc -n nginx-operator-system

# 2. 檢查 webhook pod 是否運行
kubectl get pods -n nginx-operator-system

# 3. 檢查證書是否正確配置
kubectl get secret -n nginx-operator-system | grep webhook

# 4. 檢查 ValidatingWebhookConfiguration
kubectl get validatingwebhookconfiguration

# 5. 查看 webhook 配置詳情
kubectl get validatingwebhookconfiguration nginx-operator-validating-webhook-configuration -o yaml

# 6. 測試 webhook service 連接
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- \
    curl -k https://nginx-operator-webhook-service.nginx-operator-system.svc:443

# 7. 如果開發環境不需要 webhook,可以禁用
export ENABLE_WEBHOOKS=false
make run

錯誤:證書驗證失敗

Error: x509: certificate signed by unknown authority

解決方案

# 1. 確保安裝了 cert-manager
kubectl get pods -n cert-manager

# 如果沒有安裝:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 2. 檢查 Certificate 資源
kubectl get certificate -n nginx-operator-system

# 3. 檢查證書狀態
kubectl describe certificate nginx-operator-serving-cert -n nginx-operator-system

# 4. 手動觸發證書重新生成
kubectl delete certificate nginx-operator-serving-cert -n nginx-operator-system
kubectl apply -f config/certmanager/certificate.yaml

# 5. 等待證書 ready
kubectl wait --for=condition=ready certificate/nginx-operator-serving-cert \
    -n nginx-operator-system --timeout=300s

11.2 調試技巧

11.2.1 本地調試

使用 Delve 調試

# 1. 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest

# 2. 啟動調試模式
dlv debug ./main.go -- --zap-devel

# 3. 設置斷點
(dlv) break internal/controller/nginx_controller.go:60
(dlv) continue

# 4. 查看變量
(dlv) print nginx
(dlv) print err

# 5. 查看調用棧
(dlv) stack

VS Code 調試配置 (.vscode/launch.json):

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug Operator",
            "type": "go",
            "request": "launch",
            "mode": "debug",
            "program": "${workspaceFolder}/main.go",
            "args": ["--zap-devel"],
            "env": {
                "ENABLE_WEBHOOKS": "false",
                "KUBECONFIG": "${env:HOME}/.kube/config"
            }
        },
        {
            "name": "Attach to Process",
            "type": "go",
            "request": "attach",
            "mode": "local",
            "processId": "${command:pickProcess}"
        }
    ]
}

11.2.2 日誌分析

增加日誌詳細度

# 運行時指定 log level
make run -- --zap-log-level=debug

# 或在代碼中:
logger.V(1).Info("Debug message", "key", value)  # debug
logger.V(2).Info("Trace message", "key", value)  # trace

結構化日誌查詢

# 使用 jq 過濾日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | jq 'select(.level == "error")'

# 查找特定資源的日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
    jq 'select(.nginx == "nginx-sample")'

# 統計錯誤類型
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
    jq -r 'select(.level == "error") | .msg' | sort | uniq -c

11.2.3 性能分析

CPU Profiling

// main.go
import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    // 啟動 pprof server
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()

    // ... 其他代碼
}
# 收集 30 秒的 CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# 交互式分析
(pprof) top10
(pprof) list reconcileDeployment
(pprof) web  # 生成可視化圖表(需要 graphviz)

Memory Profiling

# 收集 heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# 分析內存分配
(pprof) top10
(pprof) list NginxReconciler.Reconcile

# 查看內存洩漏
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap

Goroutine 分析

# 查看所有 goroutine
curl http://localhost:6060/debug/pprof/goroutine?debug=2

# 分析 goroutine 洩漏
go tool pprof http://localhost:6060/debug/pprof/goroutine

11.2.4 資源狀態檢查

完整的資源檢查腳本

#!/bin/bash
# debug-nginx.sh - Nginx Operator 診斷腳本

NAMESPACE=${1:-default}
NGINX_NAME=${2:-nginx-sample}

echo "=== Nginx Resource ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== Nginx Status ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o jsonpath='{.status}' | jq

echo -e "\n=== Nginx Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME

echo -e "\n=== Deployment ==="
kubectl get deployment $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== Deployment Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME,involvedObject.kind=Deployment

echo -e "\n=== Pods ==="
kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME

echo -e "\n=== Pod Events ==="
for pod in $(kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME -o name); do
    echo "Events for $pod:"
    kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$(basename $pod)
done

echo -e "\n=== Service ==="
kubectl get svc $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== ConfigMap ==="
kubectl get cm ${NGINX_NAME}-config -n $NAMESPACE -o yaml

echo -e "\n=== Operator Logs (last 50 lines) ==="
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager --tail=50

使用:

chmod +x debug-nginx.sh
./debug-nginx.sh default nginx-sample

11.3 故障排除流程

11.3.1 問題診斷流程圖

1. CR 能否創建成功?
   ├─ No  → 檢查 CRD 安裝、Validation Webhook
   └─ Yes → 繼續

2. Operator 能否收到事件?
   ├─ No  → 檢查 Operator 運行狀態、RBAC 權限、Namespace 配置
   └─ Yes → 繼續

3. Reconcile 是否執行?
   ├─ No  → 檢查 Predicate 過濾、事件觸發條件
   └─ Yes → 繼續

4. 子資源是否創建?
   ├─ No  → 檢查 RBAC、資源定義、錯誤日誌
   └─ Yes → 繼續

5. 子資源狀態是否正常?
   ├─ No  → 檢查資源配置、鏡像、探針、資源限制
   └─ Yes → 問題解決

11.3.2 常見場景排查

場景 1:Pod 無法啟動

# 1. 查看 Pod 狀態
kubectl get pods -l nginx=nginx-sample

# 2. 查看詳細錯誤
kubectl describe pod <pod-name>

# 3. 查看日誌
kubectl logs <pod-name>

# 常見原因:
# - ImagePullBackOff → 鏡像不存在或無權限
# - CrashLoopBackOff → 容器啟動失敗,檢查配置和日誌
# - Pending → 資源不足或調度失敗

場景 2:配置更新不生效

# 1. 檢查 ConfigMap 是否更新
kubectl get cm nginx-sample-config -o yaml

# 2. 檢查 Deployment 的 config-version label
kubectl get deployment nginx-sample -o jsonpath='{.spec.template.metadata.labels.config-version}'

# 3. 如果 label 沒變,檢查 reconcile 日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | grep nginx-sample

# 4. 手動觸發 reconcile
kubectl annotate nginx nginx-sample force-reconcile="$(date +%s)"

場景 3:資源洩漏

# 1. 列出所有孤立資源(沒有 owner reference)
kubectl get deployment,svc,cm --all-namespaces -o json | \
    jq -r '.items[] | select(.metadata.ownerReferences == null) | "\(.metadata.namespace)/\(.kind)/\(.metadata.name)"'

# 2. 清理孤立資源
kubectl delete deployment <name> -n <namespace>

# 3. 確保 Controller 設置了 Owner Reference
# 代碼檢查:
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

11.4 最佳實踐總結

11.4.1 開發階段

  1. 使用 Kind 本地測試

     make kind-cluster
     make install
     make run
    
  2. 頻繁運行測試

     make test
     make test-e2e
    
  3. 使用 Linter

     golangci-lint run ./...
    

11.4.2 部署階段

  1. 使用 Kustomize 管理環境

     config/
     ├── base/
     └── overlays/
         ├── development/
         ├── staging/
         └── production/
    
  2. 配置資源限制

     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 500m
         memory: 512Mi
    
  3. 啟用 Leader Election

     LeaderElection: true
    

11.4.3 運維階段

  1. 監控關鍵指標

    • Reconcile 成功率

    • Reconcile 延遲

    • 資源使用率

  2. 配置告警

    • Operator Down

    • 高錯誤率

    • 資源洩漏

  3. 定期備份

     kubectl get nginx --all-namespaces -o yaml > backup.yaml
    

總結

本教學從基礎概念到高級實踐,完整覆蓋了 Kubernetes Operator 開發的各個方面:

  1. 第一~三章:Operator 基礎概念、架構設計

  2. 第四~六章:深入 OpenTelemetry Operator 代碼分析

  3. 第七~八章:開發環境、測試實踐

  4. 第九章:完整的 Nginx Operator 實戰

  5. 第十章:性能、安全、可觀測性等進階主題

  6. 第十一章:故障排查和最佳實踐

下一步建議

  1. 閱讀 Operator SDK 文檔

  2. 研究優秀的開源 Operator 項目

  3. 在實際項目中實踐所學知識

成為 Kubernetes Operator 開發專家! Go 🚀

More from this blog

Claude Code 監控秘錄:OpenTelemetry(OTel/OTLP)實戰指南

稟告主公:此乃司馬懿進呈之兵書,詳解如何以 OpenTelemetry 陣法,令臥龍神算之一舉一動盡在掌握,知糧草消耗、察兵器效能、辨戰報異常,使主公運籌帷幄於大帳之中。 為何需要斥候情報? 司馬懿稟告主公: 臥龍神算(Claude Code)乃當世利器,然若無斥候回報,主公便如蒙眼行軍——兵器耗損幾何、糧草消費幾許、哪路斥候出了差錯,一概不知。臣以為,此乃兵家大忌。 無情報之弊,有四: 軍

Feb 19, 202610 min read241
Claude Code 監控秘錄:OpenTelemetry(OTel/OTLP)實戰指南

工程師的 Claude Code 實戰指南:從零開始到高效開發

工程師的 Claude Code 實戰指南:從零開始到高效開發 本文整合 Anthropic 官方 Best Practices 與社群實戰 Tips,帶你由淺入深掌握 Claude Code。 什麼是 Claude Code?為什麼值得學? 如果你還在用「複製程式碼貼到 ChatGPT,再複製答案貼回去」的工作流程,Claude Code 會讓你大開眼界。 Claude Code 是 Anthropic 推出的命令列工具,它直接活在你的 terminal 裡,能夠讀懂你的整個 codeb...

Feb 18, 20265 min read109
工程師的 Claude Code 實戰指南:從零開始到高效開發
M

MicroFIRE

73 posts