Skip to main content

Command Palette

Search for a command to run...

📚 透過 OpenTelemetry Operator 學習 Kubernetes Operator 開發指南

Updated
62 min read

📚 透過 OpenTelemetry Operator 深度學習 Kubernetes Operator 開發

本教程基於生產級專案 OpenTelemetry Operator 的實際代碼

涵蓋從基礎概念到高級實戰的完整學習路徑


目錄

  1. Kubernetes Operator 核心概念

  2. OpenTelemetry Operator 架構深度解析

  3. CRD 完整剖析與實戰

  4. Controller/Reconciler 深度實現

  5. Manifest 構建器詳解

  6. Webhook 機制深度解析

  7. 開發環境完整設置

  8. 測試策略與實踐

  9. 實戰專案:Nginx Operator

  10. 進階主題與最佳實踐

  11. 常見問題與調試技巧


一、Kubernetes Operator 核心概念

1.1 什麼是 Operator?

Operator 是 Kubernetes 的一種擴展模式,用於自動化複雜應用的部署和管理。它將人類運維知識編碼到軟體中。

核心組成部分

┌─────────────────────────────────────────────────────────┐
│                     Kubernetes Operator                 │
│                                                          │
│  ┌────────────────┐  ┌──────────────┐  ┌─────────────┐ │
│  │  Custom        │  │  Controller  │  │   Domain    │ │
│  │  Resource      │◄─┤  (Reconciler)├─►│  Knowledge  │ │
│  │  Definition    │  │              │  │             │ │
│  └────────────────┘  └──────────────┘  └─────────────┘ │
│         │                    │                 │        │
│         │                    │                 │        │
│         ▼                    ▼                 ▼        │
│    擴展 K8s API        監控&調和狀態      運維邏輯編碼  │
└─────────────────────────────────────────────────────────┘

1.2 工作原理

User (kubectl apply)
    │
    ▼
┌─────────────────────────────────────┐
│  Custom Resource (CR)                │
│  例如: OpenTelemetryCollector        │
│                                      │
│  apiVersion: opentelemetry.io/v1beta1│
│  kind: OpenTelemetryCollector        │
│  spec:                               │
│    mode: deployment                  │
│    config: {...}                     │
└─────────────────────────────────────┘
    │ (儲存到 etcd)
    ▼
┌─────────────────────────────────────┐
│  Kubernetes API Server               │
│  (發送 Watch 事件)                   │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Controller (Reconciler)             │
│                                      │
│  1. 接收事件                         │
│  2. 讀取 CR                          │
│  3. 計算期望狀態                     │
│  4. 調和當前狀態 → 期望狀態          │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Kubernetes Resources                │
│  - Deployment                        │
│  - Service                           │
│  - ConfigMap                         │
│  - ServiceAccount                    │
│  - ...                               │
└─────────────────────────────────────┘

1.3 Operator 能力等級

等級能力OpenTelemetry Operator 支持
1基本安裝
2無縫升級✅ (自動版本升級)
3完整生命週期✅ (Finalizer、備份配置)
4深度洞察✅ (Metrics、Events)
5自動調優✅ (HPA、Target Allocator)

二、OpenTelemetry Operator 架構深度解析

2.1 專案完整結構

opentelemetry-operator/
├── main.go                         # 程式入口點 (200+ 行)
│
├── apis/                           # CRD API 定義
│   ├── v1alpha1/                   # Alpha API
│   │   ├── instrumentation_types.go       # 自動埋點 CRD
│   │   ├── opampbridge_types.go          # OpAMP Bridge CRD
│   │   └── targetallocator_types.go      # Target Allocator CRD (deprecated)
│   └── v1beta1/                    # Beta API (穩定版)
│       ├── opentelemetrycollector_types.go  # Collector CRD (主要)
│       ├── targetallocator_types.go         # Target Allocator CRD
│       └── config.go                        # 配置解析
│
├── internal/                       # 內部實現
│   ├── controllers/                # 控制器實現
│   │   ├── opentelemetrycollector_controller.go  # 主控制器 (400+ 行)
│   │   ├── targetallocator_controller.go
│   │   ├── opampbridge_controller.go
│   │   └── reconcile_test.go              # 測試 (56KB!)
│   │
│   ├── manifests/                  # K8s 資源構建器
│   │   ├── collector/              # Collector 資源構建
│   │   │   ├── deployment.go      # Deployment 構建
│   │   │   ├── daemonset.go       # DaemonSet 構建
│   │   │   ├── statefulset.go     # StatefulSet 構建
│   │   │   ├── container.go       # Container 構建 (核心邏輯)
│   │   │   ├── service.go         # Service 構建
│   │   │   ├── configmap.go       # ConfigMap 構建
│   │   │   └── ...
│   │   ├── targetallocator/        # Target Allocator 資源構建
│   │   └── manifestutils/          # 工具函數
│   │
│   ├── webhook/                    # Admission Webhooks
│   │   ├── podmutation/            # Pod Mutation Webhook
│   │   │   └── webhookhandler.go  # Sidecar/Instrumentation 注入
│   │   └── validation/             # Validation Webhook
│   │
│   ├── config/                     # Operator 配置管理
│   ├── rbac/                       # RBAC 工具
│   ├── autodetect/                 # 環境自動檢測
│   │   ├── prometheus/             # Prometheus Operator 檢測
│   │   ├── openshift/              # OpenShift 檢測
│   │   └── certmanager/            # cert-manager 檢測
│   └── status/                     # Status 更新邏輯
│
├── pkg/                            # 公開包
│   ├── collector/                  # Collector 工具
│   │   └── upgrade/                # 版本升級邏輯
│   ├── sidecar/                    # Sidecar 注入邏輯
│   ├── featuregate/                # Feature Gate 管理
│   └── constants/                  # 常量定義
│
├── config/                         # Kustomize 部署配置
│   ├── crd/                        # CRD YAML 文件
│   │   └── bases/                  # 基礎 CRD
│   ├── rbac/                       # RBAC 規則
│   ├── manager/                    # Operator Deployment
│   ├── webhook/                    # Webhook 配置
│   └── samples/                    # CR 範例
│
├── tests/                          # 測試套件
│   ├── e2e/                        # 基礎 E2E 測試
│   ├── e2e-instrumentation/        # 自動埋點測試
│   ├── e2e-targetallocator/        # Target Allocator 測試
│   ├── e2e-upgrade/                # 升級測試
│   └── test-e2e-apps/              # 測試應用
│
├── cmd/                            # 額外的可執行程序
│   ├── otel-allocator/             # Target Allocator 服務
│   ├── operator-opamp-bridge/      # OpAMP Bridge 服務
│   └── gather/                     # 故障排查工具
│
├── Makefile                        # 構建腳本
├── go.mod                          # Go 依賴管理
└── versions.txt                    # 版本管理文件

2.2 四大核心 CRD 詳解

2.2.1 OpenTelemetryCollector (主要 CRD)

檔案: apis/v1beta1/opentelemetrycollector_types.go

用途: 管理 OpenTelemetry Collector 的部署

支持的部署模式:

const (
    ModeDeployment  Mode = "deployment"   // 標準部署
    ModeDaemonSet   Mode = "daemonset"    // 每個節點一個實例
    ModeStatefulSet Mode = "statefulset"  // 有狀態部署
    ModeSidecar     Mode = "sidecar"      // 注入到其他 Pod
)

功能特性:

  • ✅ 多種部署模式

  • ✅ 自動配置管理(ConfigMap 版本控制)

  • ✅ Target Allocator 集成

  • ✅ HPA 自動擴縮容

  • ✅ 版本自動升級

  • ✅ Ingress/Route 支持

  • ✅ PodMonitor/ServiceMonitor 支持

2.2.2 Instrumentation

檔案: apis/v1alpha1/instrumentation_types.go

用途: 配置自動埋點(Auto-instrumentation)

支持的語言:

  • Java (OpenTelemetry Java Agent)

  • Node.js (OpenTelemetry Node.js)

  • Python (OpenTelemetry Python)

  • .NET (OpenTelemetry .NET)

  • Go (eBPF-based)

  • Apache HTTPD

  • Nginx

2.2.3 TargetAllocator

檔案: apis/v1beta1/targetallocator_types.go

用途: 分配 Prometheus 抓取目標到多個 Collector 實例

分配策略:

  • least-weighted: 最少加權(基於目標數量)

  • consistent-hashing: 一致性哈希

  • per-node: 每個節點一個 Allocator

2.2.4 OpAMPBridge

檔案: apis/v1alpha1/opampbridge_types.go

用途: 實現 OpAMP 協議,實現遠程管理 Collector


三、CRD 完整剖析與實戰

3.1 CRD 結構深度解析

3.1.1 OpenTelemetryCollector CRD 完整定義

檔案: apis/v1beta1/opentelemetrycollector_types.go:32-145

// +kubebuilder:object:root=true
// +kubebuilder:resource:shortName=otelcol;otelcols
// +kubebuilder:storageversion
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.scale.replicas,selectorpath=.status.scale.selector
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Version",type="string",JSONPath=".status.version",description="OpenTelemetry Version"
// +kubebuilder:printcolumn:name="Ready",type="string",JSONPath=".status.scale.statusReplicas"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"
// +kubebuilder:printcolumn:name="Image",type="string",JSONPath=".status.image"
// +kubebuilder:printcolumn:name="Management",type="string",JSONPath=".spec.managementState",description="Management State"

type OpenTelemetryCollector struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   OpenTelemetryCollectorSpec   `json:"spec,omitempty"`
    Status OpenTelemetryCollectorStatus `json:"status,omitempty"`
}

3.1.2 Kubebuilder 註解完全指南

基礎註解

註解說明範例
+kubebuilder:object:root=true標記為 CRD 根物件必須添加
+kubebuilder:subresource:status啟用 status 子資源允許獨立更新 status
+kubebuilder:subresource:scale啟用 scale 子資源支持 kubectl scale
+kubebuilder:resource:shortName定義簡稱otelcol,otelcols
+kubebuilder:storageversion標記為存儲版本多版本時指定

驗證註解

// 數值驗證
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:Maximum=100
Replicas int32 `json:"replicas,omitempty"`

// 字符串驗證
// +kubebuilder:validation:MinLength=1
// +kubebuilder:validation:MaxLength=255
// +kubebuilder:validation:Pattern=`^[a-z0-9]([-a-z0-9]*[a-z0-9])?$`
Name string `json:"name,omitempty"`

// 枚舉驗證
// +kubebuilder:validation:Enum=deployment;daemonset;statefulset;sidecar
Mode string `json:"mode,omitempty"`

// 默認值
// +kubebuilder:default=1
Replicas int32 `json:"replicas,omitempty"`

// CEL 驗證 (Kubernetes 1.25+)
// +kubebuilder:validation:XValidation:rule="self.mode != 'sidecar' || !has(self.replicas)",message="sidecar mode does not support replicas"
Spec OpenTelemetryCollectorSpec `json:"spec,omitempty"`

顯示註解

// kubectl get 輸出欄位
// +kubebuilder:printcolumn:name="Mode",type="string",JSONPath=".spec.mode",description="Deployment Mode"
// +kubebuilder:printcolumn:name="Age",type="date",JSONPath=".metadata.creationTimestamp"

3.1.3 完整 Spec 結構

type OpenTelemetryCollectorSpec struct {
    // ===== 部署配置 =====

    // 部署模式
    // +optional
    // +kubebuilder:default=deployment
    Mode Mode `json:"mode,omitempty"`

    // 副本數(僅 Deployment/StatefulSet 模式)
    // +optional
    Replicas *int32 `json:"replicas,omitempty"`

    // 容器鏡像
    // +optional
    Image string `json:"image,omitempty"`

    // ===== Collector 配置 =====

    // Collector 配置(YAML 格式)
    // +required
    // +kubebuilder:pruning:PreserveUnknownFields
    Config Config `json:"config"`

    // ConfigMap 版本保留數量(用於回滾)
    // +optional
    // +kubebuilder:default:=3
    // +kubebuilder:validation:Minimum:=1
    ConfigVersions int `json:"configVersions,omitempty"`

    // ===== 升級策略 =====

    // 升級策略:automatic 或 none
    // +optional
    UpgradeStrategy UpgradeStrategy `json:"upgradeStrategy"`

    // Deployment 更新策略
    // +optional
    DeploymentUpdateStrategy appsv1.DeploymentStrategy `json:"deploymentUpdateStrategy,omitempty"`

    // DaemonSet 更新策略
    // +optional
    DaemonSetUpdateStrategy appsv1.DaemonSetUpdateStrategy `json:"daemonSetUpdateStrategy,omitempty"`

    // ===== 資源配置 =====

    // 資源限制
    // +optional
    Resources v1.ResourceRequirements `json:"resources,omitempty"`

    // 環境變數
    // +optional
    Env []v1.EnvVar `json:"env,omitempty"`

    // Volume Mounts
    // +optional
    VolumeMounts []v1.VolumeMount `json:"volumeMounts,omitempty"`

    // Volumes
    // +optional
    Volumes []v1.Volume `json:"volumes,omitempty"`

    // ===== 調度配置 =====

    // Node Selector
    // +optional
    NodeSelector map[string]string `json:"nodeSelector,omitempty"`

    // Tolerations
    // +optional
    Tolerations []v1.Toleration `json:"tolerations,omitempty"`

    // Affinity
    // +optional
    Affinity *v1.Affinity `json:"affinity,omitempty"`

    // ===== 網路配置 =====

    // Ingress 配置
    // +optional
    Ingress Ingress `json:"ingress,omitempty"`

    // Service 端口
    // +optional
    Ports []PortsSpec `json:"ports,omitempty"`

    // ===== 高級功能 =====

    // Target Allocator
    // +optional
    TargetAllocator TargetAllocatorEmbedded `json:"targetAllocator,omitempty"`

    // HPA 配置
    // +optional
    Autoscaler *AutoscalerSpec `json:"autoscaler,omitempty"`

    // 探針配置
    // +optional
    LivenessProbe *Probe `json:"livenessProbe,omitempty"`
    ReadinessProbe *Probe `json:"readinessProbe,omitempty"`

    // ===== 可觀測性 =====

    // Observability 配置
    // +optional
    Observability ObservabilitySpec `json:"observability,omitempty"`
}

3.2 實戰範例:從簡單到複雜

3.2.1 最簡範例

檔案: config/samples/core_v1beta1_opentelemetrycollector.yaml

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
    exporters:
      debug: {}
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [debug]

這個 CR 會創建:

# 1. Deployment (1 replica)
kubectl get deployment simplest-collector

# 2. Service (暴露 OTLP 端口)
kubectl get service simplest-collector
# Ports: 4317 (gRPC), 4318 (HTTP)

# 3. Service (headless, 用於 StatefulSet)
kubectl get service simplest-collector-headless

# 4. ConfigMap (Collector 配置)
kubectl get configmap simplest-collector

# 5. ServiceAccount
kubectl get serviceaccount simplest-collector

3.2.2 生產級範例

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: production-collector
  labels:
    env: production
spec:
  # ===== 部署配置 =====
  mode: deployment
  replicas: 3  # 高可用
  image: otel/opentelemetry-collector-k8s:0.88.0

  # ===== 升級策略 =====
  upgradeStrategy: automatic
  deploymentUpdateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

  # ===== ConfigMap 版本控制 =====
  configVersions: 5  # 保留 5 個版本用於回滾

  # ===== Collector 配置 =====
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318

      # Prometheus receiver
      prometheus:
        config:
          scrape_configs:
            - job_name: 'otel-collector'
              scrape_interval: 30s
              static_configs:
                - targets: ['0.0.0.0:8888']

    processors:
      # 內存限制器(防止 OOM)
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15

      # 批處理(提升性能)
      batch:
        send_batch_size: 10000
        timeout: 10s

      # 資源屬性
      resource:
        attributes:
          - key: cluster.name
            value: production-cluster
            action: upsert

    exporters:
      # OTLP 導出到後端
      otlp:
        endpoint: backend.example.com:4317
        tls:
          insecure: false
          cert_file: /certs/tls.crt
          key_file: /certs/tls.key

      # Prometheus 導出
      prometheus:
        endpoint: 0.0.0.0:8889

    # Health Check Extension
    extensions:
      health_check:
        endpoint: 0.0.0.0:13133

      pprof:
        endpoint: localhost:1777

    service:
      extensions: [health_check, pprof]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch, resource]
          exporters: [otlp]

        metrics:
          receivers: [otlp, prometheus]
          processors: [memory_limiter, batch]
          exporters: [otlp, prometheus]

  # ===== 資源限制 =====
  resources:
    requests:
      cpu: 500m
      memory: 512Mi
    limits:
      cpu: 2000m
      memory: 2Gi

  # ===== 環境變數 =====
  env:
    - name: MY_POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP

  # ===== 調度配置 =====
  nodeSelector:
    workload: telemetry

  tolerations:
    - key: telemetry
      operator: Equal
      value: "true"
      effect: NoSchedule

  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - production-collector-collector
            topologyKey: kubernetes.io/hostname

  # ===== Ingress 配置 =====
  ingress:
    type: ingress
    hostname: collector.example.com
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
    tls:
      - secretName: collector-tls
        hosts:
          - collector.example.com

  # ===== HPA 自動擴縮容 =====
  autoscaler:
    minReplicas: 3
    maxReplicas: 10
    behavior:
      scaleDown:
        stabilizationWindowSeconds: 300
        policies:
          - type: Percent
            value: 50
            periodSeconds: 60
      scaleUp:
        stabilizationWindowSeconds: 0
        policies:
          - type: Percent
            value: 100
            periodSeconds: 15
    metrics:
      - type: Resource
        resource:
          name: cpu
          target:
            type: Utilization
            averageUtilization: 75
      - type: Resource
        resource:
          name: memory
          target:
            type: Utilization
            averageUtilization: 80

  # ===== 探針配置 =====
  livenessProbe:
    httpGet:
      path: /
      port: 13133
    initialDelaySeconds: 15
    periodSeconds: 10

  readinessProbe:
    httpGet:
      path: /
      port: 13133
    initialDelaySeconds: 10
    periodSeconds: 5

  # ===== Target Allocator =====
  targetAllocator:
    enabled: true
    replicas: 3
    allocationStrategy: consistent-hashing
    prometheusCR:
      enabled: true
      serviceMonitorSelector: {}
      podMonitorSelector: {}

  # ===== 可觀測性 =====
  observability:
    metrics:
      enableMetrics: true

3.2.3 DaemonSet 模式範例

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: node-collector
spec:
  mode: daemonset  # 每個節點一個實例

  # DaemonSet 更新策略
  daemonSetUpdateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1

  # Host Network(訪問節點級資源)
  hostNetwork: true

  config:
    receivers:
      # 主機指標
      hostmetrics:
        collection_interval: 30s
        scrapers:
          cpu: {}
          load: {}
          memory: {}
          disk: {}
          filesystem: {}
          network: {}

      # Kubernetes Events
      k8s_events:
        namespaces: [default, kube-system]

    exporters:
      otlp:
        endpoint: central-collector:4317

    service:
      pipelines:
        metrics:
          receivers: [hostmetrics]
          exporters: [otlp]

四、Controller/Reconciler 深度實現

4.1 Reconciler 結構體詳解

檔案: internal/controllers/opentelemetrycollector_controller.go:58-67

type OpenTelemetryCollectorReconciler struct {
    // K8s 客戶端(帶緩存)
    client.Client

    // 事件記錄器(發送 K8s 事件)
    recorder record.EventRecorder

    // API Scheme(資源類型註冊表)
    scheme *runtime.Scheme

    // 結構化日誌器
    log logr.Logger

    // Operator 配置
    config config.Config

    // RBAC 權限審查器
    reviewer *internalRbac.Reviewer

    // 版本升級處理器
    upgrade *upgrade.VersionUpgrade
}

4.1.1 依賴組件解析

Client.Client

// controller-runtime 提供的智能客戶端
// 特性:
// 1. 自動緩存(減少 API Server 壓力)
// 2. 類型安全
// 3. 支持 FieldIndexer(加速查詢)
// 4. 自動處理 RESTMapper

// 使用範例
err := r.Client.Get(ctx, types.NamespacedName{
    Name:      "my-collector",
    Namespace: "default",
}, &collector)

EventRecorder

// 記錄 Kubernetes 事件
// kubectl describe 時可以看到

r.recorder.Event(
    &instance,                    // 相關物件
    corev1.EventTypeNormal,       // 事件類型
    "Created",                    // 原因
    "Created OpenTelemetry Collector deployment",  // 消息
)

Reviewer (RBAC 檢查器)

// 檢查 Collector 配置中的 RBAC 需求
// 例如:如果配置中有 k8s_cluster receiver,需要額外權限

if reviewer.NeedsClusterRole(collectorConfig) {
    // 創建 ClusterRole 和 ClusterRoleBinding
}

4.2 Reconcile 完整流程追蹤

檔案: internal/controllers/opentelemetrycollector_controller.go:234-314

讓我們逐步追蹤一個完整的 Reconcile 流程:

func (r *OpenTelemetryCollectorReconciler) Reconcile(
    ctx context.Context,
    req ctrl.Request,
) (ctrl.Result, error) {
    log := r.log.WithValues("opentelemetrycollector", req.NamespacedName)

    // ==================== 階段 1: 獲取 CR ====================
    log.Info("Reconciling OpenTelemetryCollector")

    var instance v1beta1.OpenTelemetryCollector
    if err := r.Get(ctx, req.NamespacedName, &instance); err != nil {
        if !apierrors.IsNotFound(err) {
            log.Error(err, "unable to fetch OpenTelemetryCollector")
        }
        // 資源已刪除,忽略錯誤
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // ==================== 階段 2: 構建參數 ====================
    // 包含所有需要的配置、Target Allocator 等
    params, err := r.GetParams(ctx, instance)
    if err != nil {
        log.Error(err, "Failed to create manifest.Params")
        return ctrl.Result{}, err
    }

    // ==================== 階段 3: 處理刪除(Finalizer 模式)====================
    if deletionTimestamp := instance.GetDeletionTimestamp(); deletionTimestamp != nil {
        log.Info("Resource is being deleted")

        if controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
            log.Info("Running finalizer logic")

            // 執行清理:刪除 ClusterRole、ClusterRoleBinding 等集群級資源
            if err = r.finalizeCollector(ctx, params); err != nil {
                log.Error(err, "Failed to finalize")
                return ctrl.Result{}, err
            }

            // 移除 Finalizer
            if controllerutil.RemoveFinalizer(&instance, collectorFinalizer) {
                err = r.Update(ctx, &instance)
                if err != nil {
                    return ctrl.Result{}, err
                }
            }
            log.Info("Finalizer removed, resource will be deleted")
        }
        return ctrl.Result{}, nil
    }

    // ==================== 階段 4: 檢查管理狀態 ====================
    // ManagementStateUnmanaged: Operator 不管理此資源
    if instance.Spec.ManagementState == v1beta1.ManagementStateUnmanaged {
        log.Info("Skipping reconciliation for unmanaged resource")
        return ctrl.Result{}, nil
    }

    // ==================== 階段 5: 版本升級 ====================
    // 檢查 CR 配置是否需要升級到新格式
    if r.upgrade.NeedsUpgrade(instance) {
        log.Info("Upgrading OpenTelemetryCollector configuration")

        err = r.upgrade.Upgrade(ctx, instance)
        if err != nil {
            log.Error(err, "Failed to upgrade")
            return ctrl.Result{}, err
        }

        // CR 被修改,重新排隊觸發新的 reconcile
        log.Info("Configuration upgraded, requeuing")
        return ctrl.Result{Requeue: true, RequeueAfter: 1 * time.Second}, nil
    }

    // ==================== 階段 6: 添加 Finalizer ====================
    // Finalizer 確保在刪除時執行清理邏輯
    if !controllerutil.ContainsFinalizer(&instance, collectorFinalizer) {
        log.Info("Adding finalizer")

        if controllerutil.AddFinalizer(&instance, collectorFinalizer) {
            err = r.Update(ctx, &instance)
            if err != nil {
                return ctrl.Result{}, err
            }
        }
    }

    // ==================== 階段 7: 構建期望資源 ====================
    log.Info("Building desired objects")

    desiredObjects, buildErr := BuildCollector(params)
    if buildErr != nil {
        log.Error(buildErr, "Failed to build desired objects")
        return ctrl.Result{}, buildErr
    }

    log.Info("Desired objects built", "count", len(desiredObjects))

    // ==================== 階段 8: 查找現有資源 ====================
    log.Info("Finding owned objects")

    ownedObjects, err := r.findOtelOwnedObjects(ctx, params)
    if err != nil {
        log.Error(err, "Failed to find owned objects")
        return ctrl.Result{}, err
    }

    log.Info("Found owned objects", "count", len(ownedObjects))

    // ==================== 階段 9: 調和資源 ====================
    // 比較期望狀態和當前狀態,執行創建/更新/刪除
    log.Info("Reconciling desired objects")

    err = reconcileDesiredObjects(
        ctx,
        r.Client,
        log,
        &instance,
        params.Scheme,
        desiredObjects,
        ownedObjects,
    )

    // ==================== 階段 10: 更新 Status ====================
    return collectorStatus.HandleReconcileStatus(ctx, log, params, instance, err)
}

4.2.1 詳細追蹤:BuildCollector 函數

檔案: internal/controllers/opentelemetrycollector_controller.go

func BuildCollector(params manifests.Params) ([]client.Object, error) {
    var objects []client.Object

    // 1. ServiceAccount
    if sa := collector.ServiceAccount(params); sa != nil {
        objects = append(objects, sa)
    }

    // 2. ConfigMap(Collector 配置)
    configMaps, err := collector.ConfigMaps(params)
    if err != nil {
        return nil, err
    }
    for _, cm := range configMaps {
        objects = append(objects, cm)
    }

    // 3. 根據模式選擇工作負載類型
    switch params.OtelCol.Spec.Mode {
    case v1alpha1.ModeDeployment:
        deployment, err := collector.Deployment(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, deployment)

    case v1alpha1.ModeDaemonSet:
        daemonset, err := collector.DaemonSet(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, daemonset)

    case v1alpha1.ModeStatefulSet:
        statefulset, err := collector.StatefulSet(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, statefulset)
    }

    // 4. Service(暴露端口)
    services := collector.Services(params)
    for _, svc := range services {
        objects = append(objects, svc)
    }

    // 5. Ingress(如果配置)
    if params.OtelCol.Spec.Ingress.Type == v1beta1.IngressTypeIngress {
        if ingress := collector.Ingress(params); ingress != nil {
            objects = append(objects, ingress)
        }
    }

    // 6. HPA(如果配置自動擴縮容)
    if params.OtelCol.Spec.Autoscaler != nil {
        if hpa := collector.HorizontalPodAutoscaler(params); hpa != nil {
            objects = append(objects, hpa)
        }
    }

    // 7. RBAC(如果需要)
    if params.Config.CreateRBACPermissions == rbac.Available {
        if clusterRole := collector.ClusterRole(params); clusterRole != nil {
            objects = append(objects, clusterRole)
        }
        if clusterRoleBinding := collector.ClusterRoleBinding(params); clusterRoleBinding != nil {
            objects = append(objects, clusterRoleBinding)
        }
    }

    // 8. ServiceMonitor/PodMonitor(可觀測性)
    if params.OtelCol.Spec.Observability.Metrics.EnableMetrics {
        if sm := collector.ServiceMonitor(params); sm != nil {
            objects = append(objects, sm)
        }
    }

    // 9. Target Allocator(如果啟用)
    if params.OtelCol.Spec.TargetAllocator.Enabled {
        taObjects, err := BuildTargetAllocator(params)
        if err != nil {
            return nil, err
        }
        objects = append(objects, taObjects...)
    }

    return objects, nil
}

4.2.2 調和邏輯:reconcileDesiredObjects

func reconcileDesiredObjects(
    ctx context.Context,
    client client.Client,
    log logr.Logger,
    owner client.Object,
    scheme *runtime.Scheme,
    desired []client.Object,
    existing map[types.UID]client.Object,
) error {
    // 為期望的資源設置 OwnerReference
    for _, obj := range desired {
        if err := controllerutil.SetControllerReference(owner, obj, scheme); err != nil {
            return err
        }
    }

    // ========== 步驟 1: 創建或更新期望的資源 ==========
    for _, desiredObj := range desired {
        log := log.WithValues(
            "kind", desiredObj.GetObjectKind().GroupVersionKind().Kind,
            "name", desiredObj.GetName(),
        )

        // 嘗試獲取現有資源
        existingObj := desiredObj.DeepCopyObject().(client.Object)
        err := client.Get(ctx, types.NamespacedName{
            Name:      desiredObj.GetName(),
            Namespace: desiredObj.GetNamespace(),
        }, existingObj)

        if err != nil && apierrors.IsNotFound(err) {
            // 資源不存在,創建
            log.Info("Creating resource")
            if err := client.Create(ctx, desiredObj); err != nil {
                log.Error(err, "Failed to create resource")
                return err
            }
            log.Info("Resource created successfully")

        } else if err != nil {
            // 其他錯誤
            return err

        } else {
            // 資源存在,更新
            log.Info("Updating resource")

            // 保留不應該變更的字段
            desiredObj.SetResourceVersion(existingObj.GetResourceVersion())

            if err := client.Update(ctx, desiredObj); err != nil {
                log.Error(err, "Failed to update resource")
                return err
            }
            log.Info("Resource updated successfully")

            // 從待刪除列表中移除
            delete(existing, existingObj.GetUID())
        }
    }

    // ========== 步驟 2: 刪除多餘的資源 ==========
    for uid, obj := range existing {
        log := log.WithValues(
            "kind", obj.GetObjectKind().GroupVersionKind().Kind,
            "name", obj.GetName(),
            "uid", uid,
        )

        log.Info("Deleting orphaned resource")
        if err := client.Delete(ctx, obj); err != nil {
            log.Error(err, "Failed to delete resource")
            return err
        }
        log.Info("Orphaned resource deleted")
    }

    return nil
}

4.3 索引與緩存優化

4.3.1 設置 Field Indexer

檔案: internal/controllers/opentelemetrycollector_controller.go:334-354

func (r *OpenTelemetryCollectorReconciler) SetupCaches(cluster cluster.Cluster) error {
    ownedResources := r.GetOwnedResourceTypes()

    // 為每種資源類型建立索引
    for _, resource := range ownedResources {
        if err := cluster.GetCache().IndexField(
            context.Background(),
            resource,
            resourceOwnerKey,  // 索引鍵:".metadata.owner"
            func(rawObj client.Object) []string {
                // 提取 Owner Reference
                owner := metav1.GetControllerOf(rawObj)
                if owner == nil {
                    return nil
                }

                // 只索引 OpenTelemetryCollector 擁有的資源
                if owner.Kind != "OpenTelemetryCollector" {
                    return nil
                }

                // 返回 Owner 名稱作為索引值
                return []string{owner.Name}
            },
        ); err != nil {
            return err
        }
    }
    return nil
}

4.3.2 使用索引加速查詢

func (r *OpenTelemetryCollectorReconciler) findOtelOwnedObjects(
    ctx context.Context,
    params manifests.Params,
) (map[types.UID]client.Object, error) {
    ownedObjects := map[types.UID]client.Object{}

    // 使用索引快速查詢(而不是掃描所有資源)
    listOpts := []client.ListOption{
        client.InNamespace(params.OtelCol.Namespace),
        client.MatchingFields{resourceOwnerKey: params.OtelCol.Name},
    }

    for _, objectType := range r.GetOwnedResourceTypes() {
        objs, err := getList(ctx, r, objectType, listOpts...)
        if err != nil {
            return nil, err
        }

        for uid, object := range objs {
            ownedObjects[uid] = object
        }
    }

    return ownedObjects, nil
}

性能對比

方法時間複雜度說明
不使用索引O(N)N = 集群中所有 Deployment 數量
使用索引O(M)M = 被這個 Collector 擁有的 Deployment 數量

在大型集群中,性能提升可達 10-100 倍!


五、Manifest 構建器詳解

5.1 Deployment 構建完整解析

檔案: internal/manifests/collector/deployment.go

func Deployment(params manifests.Params) (*appsv1.Deployment, error) {
    name := naming.Collector(params.OtelCol.Name)

    // ========== 1. 構建標籤 ==========
    // 標籤用於:
    // - Service 選擇器
    // - HPA 目標選擇
    // - 監控發現
    labels := manifestutils.Labels(
        params.OtelCol.ObjectMeta,
        name,
        params.OtelCol.Spec.Image,
        ComponentOpenTelemetryCollector,
        params.Config.LabelsFilter,
    )

    // 標籤範例:
    // app.kubernetes.io/name: my-collector-collector
    // app.kubernetes.io/instance: my-collector
    // app.kubernetes.io/managed-by: opentelemetry-operator
    // app.kubernetes.io/component: opentelemetry-collector
    // app.kubernetes.io/version: 0.88.0

    // ========== 2. 構建註解 ==========
    annotations, err := manifestutils.Annotations(
        params.OtelCol,
        params.Config.AnnotationsFilter,
    )
    if err != nil {
        return nil, err
    }

    // Pod 註解(額外添加 ConfigMap 哈希)
    podAnnotations, err := manifestutils.PodAnnotations(
        params.OtelCol,
        params.Config.AnnotationsFilter,
    )
    if err != nil {
        return nil, err
    }

    // ========== 3. 構建 Deployment ==========
    return &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:        name,
            Namespace:   params.OtelCol.Namespace,
            Labels:      labels,
            Annotations: annotations,
        },
        Spec: appsv1.DeploymentSpec{
            // 副本數
            Replicas: manifestutils.GetInitialReplicas(params.OtelCol),

            // 選擇器(必須與 Pod 標籤匹配)
            Selector: &metav1.LabelSelector{
                MatchLabels: manifestutils.SelectorLabels(
                    params.OtelCol.ObjectMeta,
                    ComponentOpenTelemetryCollector,
                ),
            },

            // 更新策略
            Strategy: params.OtelCol.Spec.DeploymentUpdateStrategy,

            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels:      labels,
                    Annotations: podAnnotations,
                },
                Spec: corev1.PodSpec{
                    // ServiceAccount
                    ServiceAccountName: ServiceAccountName(params.OtelCol),

                    // Init Containers
                    InitContainers: params.OtelCol.Spec.InitContainers,

                    // 主容器 + 額外容器
                    Containers: append(
                        params.OtelCol.Spec.AdditionalContainers,
                        Container(params.Config, params.Log, params.OtelCol, true),
                    ),

                    // Volumes
                    Volumes: Volumes(params.Config, params.OtelCol),

                    // 網路配置
                    DNSPolicy:   manifestutils.GetDNSPolicy(...),
                    DNSConfig:   &params.OtelCol.Spec.PodDNSConfig,
                    HostNetwork: params.OtelCol.Spec.HostNetwork,

                    // 進程命名空間共享
                    ShareProcessNamespace: &params.OtelCol.Spec.ShareProcessNamespace,

                    // 調度配置
                    Tolerations:               params.OtelCol.Spec.Tolerations,
                    NodeSelector:              params.OtelCol.Spec.NodeSelector,
                    Affinity:                  params.OtelCol.Spec.Affinity,
                    TopologySpreadConstraints: params.OtelCol.Spec.TopologySpreadConstraints,

                    // 安全配置
                    SecurityContext:   params.OtelCol.Spec.PodSecurityContext,
                    PriorityClassName: params.OtelCol.Spec.PriorityClassName,

                    // 優雅終止
                    TerminationGracePeriodSeconds: params.OtelCol.Spec.TerminationGracePeriodSeconds,
                },
            },
        },
    }, nil
}

5.2 Container 構建深度解析

檔案: internal/manifests/collector/container.go:28-150

func Container(
    cfg config.Config,
    logger logr.Logger,
    otelcol v1beta1.OpenTelemetryCollector,
    addConfig bool,
) corev1.Container {
    // ========== 1. 決定鏡像 ==========
    image := otelcol.Spec.Image
    if len(image) == 0 {
        image = cfg.CollectorImage  // 使用預設鏡像
    }

    // ========== 2. 構建端口列表 ==========
    ports := getContainerPorts(logger, otelcol)
    // 從 Collector 配置中解析端口
    // 例如:otlp.protocols.grpc.endpoint: 0.0.0.0:4317
    //      → ContainerPort{Name: "otlp-grpc", ContainerPort: 4317}

    // ========== 3. 構建啟動參數 ==========
    var args []string
    var volumeMounts []corev1.VolumeMount

    // 配置文件始終是第一個參數
    if addConfig {
        args = append(args, fmt.Sprintf(
            "--config=/conf/%s",
            cfg.CollectorConfigMapEntry,
        ))

        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.ConfigMapVolume(),
            MountPath: "/conf",
        })
    }

    // 如果啟用 Target Allocator mTLS
    if otelcol.Spec.TargetAllocator.Enabled &&
       cfg.CertManagerAvailability == certmanager.Available &&
       featuregate.EnableTargetAllocatorMTLS.IsEnabled() {
        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.TAClientCertificate(otelcol.Name),
            MountPath: constants.TACollectorTLSDirPath,
        })
    }

    // 額外的用戶自定義參數(排序保證一致性)
    argsMap := otelcol.Spec.Args
    if argsMap == nil {
        argsMap = map[string]string{}
    }

    var sortedArgs []string
    for k, v := range argsMap {
        sortedArgs = append(sortedArgs, fmt.Sprintf("--%s=%s", k, v))
    }
    sort.Strings(sortedArgs)
    args = append(args, sortedArgs...)

    // ========== 4. Volume Mounts ==========
    if len(otelcol.Spec.VolumeMounts) > 0 {
        volumeMounts = append(volumeMounts, otelcol.Spec.VolumeMounts...)
    }

    // 額外 ConfigMap 掛載
    for _, cm := range otelcol.Spec.ConfigMaps {
        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      naming.ConfigMapExtra(cm.Name),
            MountPath: path.Join("/var/conf", cm.MountPath, naming.ConfigMapExtra(cm.Name)),
        })
    }

    // ========== 5. 健康檢查探針 ==========
    // Liveness Probe
    livenessProbe, err := otelcol.Spec.Config.GetLivenessProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create liveness probe")
    } else {
        defaultProbeSettings(livenessProbe, otelcol.Spec.LivenessProbe)
    }

    // Readiness Probe
    readinessProbe, err := otelcol.Spec.Config.GetReadinessProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create readiness probe")
    } else {
        defaultProbeSettings(readinessProbe, otelcol.Spec.ReadinessProbe)
    }

    // Startup Probe
    startupProbe, err := otelcol.Spec.Config.GetStartupProbe(logger)
    if err != nil {
        logger.Error(err, "cannot create startup probe")
    }

    // ========== 6. 環境變數 ==========
    envVars := otelcol.Spec.Env

    // 添加預設環境變數
    envVars = append(envVars, corev1.EnvVar{
        Name: "POD_NAME",
        ValueFrom: &corev1.EnvVarSource{
            FieldRef: &corev1.ObjectFieldSelector{
                FieldPath: "metadata.name",
            },
        },
    })

    // ========== 7. 返回完整 Container ==========
    return corev1.Container{
        Name:            naming.Container(),
        Image:           image,
        ImagePullPolicy: otelcol.Spec.ImagePullPolicy,
        Args:            args,
        Ports:           ports,
        VolumeMounts:    volumeMounts,
        Env:             envVars,
        EnvFrom:         otelcol.Spec.EnvFrom,
        Resources:       otelcol.Spec.Resources,
        LivenessProbe:   livenessProbe,
        ReadinessProbe:  readinessProbe,
        StartupProbe:    startupProbe,
        SecurityContext: otelcol.Spec.SecurityContext,
    }
}

5.2.1 端口解析邏輯

func getContainerPorts(logger logr.Logger, otelcol v1beta1.OpenTelemetryCollector) []corev1.ContainerPort {
    var ports []corev1.ContainerPort

    // 從 Collector 配置中解析端口
    cfg := otelcol.Spec.Config

    // 解析 receivers 中的端口
    if receivers, ok := cfg.Receivers.Object["otlp"].(map[string]interface{}); ok {
        if protocols, ok := receivers["protocols"].(map[string]interface{}); ok {
            // gRPC 端口
            if grpc, ok := protocols["grpc"].(map[string]interface{}); ok {
                if endpoint, ok := grpc["endpoint"].(string); ok {
                    port := extractPort(endpoint)  // "0.0.0.0:4317" → 4317
                    ports = append(ports, corev1.ContainerPort{
                        Name:          "otlp-grpc",
                        ContainerPort: port,
                        Protocol:      corev1.ProtocolTCP,
                    })
                }
            }

            // HTTP 端口
            if http, ok := protocols["http"].(map[string]interface{}); ok {
                if endpoint, ok := http["endpoint"].(string); ok {
                    port := extractPort(endpoint)  // "0.0.0.0:4318" → 4318
                    ports = append(ports, corev1.ContainerPort{
                        Name:          "otlp-http",
                        ContainerPort: port,
                        Protocol:      corev1.ProtocolTCP,
                    })
                }
            }
        }
    }

    // 用戶自定義端口
    for _, portSpec := range otelcol.Spec.Ports {
        ports = append(ports, corev1.ContainerPort{
            Name:          portSpec.Name,
            ContainerPort: portSpec.Port,
            Protocol:      portSpec.Protocol,
        })
    }

    return ports
}

5.3 ConfigMap 構建與版本控制

檔案: internal/manifests/collector/configmap.go

func ConfigMaps(params manifests.Params) ([]*corev1.ConfigMap, error) {
    var configMaps []*corev1.ConfigMap

    // ========== 1. 構建主 ConfigMap ==========
    name := naming.ConfigMap(params.OtelCol.Name)

    // 序列化 Collector 配置為 YAML
    configYAML, err := params.OtelCol.Spec.Config.Yaml()
    if err != nil {
        return nil, err
    }

    // 計算配置哈希(用於觸發 Pod 重啟)
    configHash := hash(configYAML)

    configMap := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      fmt.Sprintf("%s-%s", name, configHash[:8]),
            Namespace: params.OtelCol.Namespace,
            Labels: manifestutils.Labels(
                params.OtelCol.ObjectMeta,
                name,
                params.OtelCol.Spec.Image,
                ComponentOpenTelemetryCollector,
                params.Config.LabelsFilter,
            ),
            Annotations: map[string]string{
                "opentelemetry.io/config-hash": configHash,
                "opentelemetry.io/created-at":  time.Now().Format(time.RFC3339),
            },
        },
        Data: map[string]string{
            cfg.CollectorConfigMapEntry: configYAML,
        },
    }

    configMaps = append(configMaps, configMap)

    // ========== 2. Target Allocator ConfigMap ==========
    if params.OtelCol.Spec.TargetAllocator.Enabled {
        taConfigMap, err := BuildTargetAllocatorConfigMap(params)
        if err != nil {
            return nil, err
        }
        configMaps = append(configMaps, taConfigMap)
    }

    return configMaps, nil
}

5.3.1 ConfigMap 版本控制實現

檔案: internal/controllers/opentelemetrycollector_controller.go:146-158

func getCollectorConfigMapsToKeep(
    configVersionsToKeep int,
    configMaps []*corev1.ConfigMap,
) []*corev1.ConfigMap {
    configVersionsToKeep = max(1, configVersionsToKeep)

    // 按創建時間排序(最新到最舊)
    sort.Slice(configMaps, func(i, j int) bool {
        iTime := configMaps[i].GetCreationTimestamp().Time
        jTime := configMaps[j].GetCreationTimestamp().Time
        return iTime.After(jTime)
    })

    // 保留最新的 N 個
    configMapsToKeep := min(configVersionsToKeep, len(configMaps))
    return configMaps[:configMapsToKeep]
}

工作流程:

配置更新 → 創建新 ConfigMap → 舊 ConfigMap 保留(用於回滾)

範例:
1. my-collector-abcd1234  (當前使用)
2. my-collector-efgh5678  (保留)
3. my-collector-ijkl9012  (保留)
4. my-collector-mnop3456  (將被刪除)

六、Webhook 機制深度解析

6.1 Webhook 概述

Kubernetes Admission Webhooks 允許在資源創建/更新前進行攔截和修改。

User: kubectl apply -f pod.yaml
    │
    ▼
API Server
    │
    ├──► Mutating Webhook  (修改資源)
    │    - Sidecar 注入
    │    - 添加標籤/註解
    │    - 修改容器配置
    │
    ├──► Validating Webhook (驗證資源)
    │    - 檢查配置合法性
    │    - 執行業務規則
    │
    └──► 存儲到 etcd

6.2 Pod Mutation Webhook 實現

檔案: internal/webhook/podmutation/webhookhandler.go

6.2.1 Webhook Handler 結構

type podMutationWebhook struct {
    client      client.Client     // K8s 客戶端
    decoder     admission.Decoder // 請求解碼器
    logger      logr.Logger       // 日誌
    podMutators []PodMutator      // Pod 變更器鏈
    config      config.Config     // 配置
}

// PodMutator 介面
type PodMutator interface {
    Mutate(ctx context.Context, ns corev1.Namespace, pod corev1.Pod) (corev1.Pod, error)
}

6.2.2 Webhook 註冊

// +kubebuilder:webhook:path=/mutate-v1-pod,mutating=true,failurePolicy=ignore,groups="",resources=pods,verbs=create,versions=v1,name=mpod.kb.io,sideEffects=none,admissionReviewVersions=v1

參數解析:

  • path: Webhook 路徑

  • mutating=true: 變更型 Webhook

  • failurePolicy=ignore: 失敗時允許請求繼續(高可用)

  • resources=pods: 只攔截 Pod

  • verbs=create: 只攔截創建操作

6.2.3 Handle 方法完整實現

func (p *podMutationWebhook) Handle(
    ctx context.Context,
    req admission.Request,
) admission.Response {
    log := p.logger.WithValues("namespace", req.Namespace, "name", req.Name)
    log.Info("Webhook called")

    // ========== 1. 解碼 Pod ==========
    pod := corev1.Pod{}
    err := p.decoder.Decode(req, &pod)
    if err != nil {
        log.Error(err, "Failed to decode pod")
        return admission.Errored(http.StatusBadRequest, err)
    }

    log.Info("Pod decoded", "podName", pod.Name)

    // ========== 2. 獲取 Namespace ==========
    ns := corev1.Namespace{}
    err = p.client.Get(ctx, types.NamespacedName{
        Name:      req.Namespace,
        Namespace: "",
    }, &ns)
    if err != nil {
        log.Error(err, "Failed to get namespace")
        res := admission.Errored(http.StatusInternalServerError, err)
        // 設置 Allowed = true,即使錯誤也不阻止 Pod 創建
        res.Allowed = true
        return res
    }

    // ========== 3. 執行所有 Mutator ==========
    originalPod := pod.DeepCopy()

    for i, mutator := range p.podMutators {
        log := log.WithValues("mutatorIndex", i)
        log.Info("Applying mutator")

        pod, err = mutator.Mutate(ctx, ns, pod)
        if err != nil {
            log.Error(err, "Mutator failed")
            res := admission.Errored(http.StatusInternalServerError, err)
            res.Allowed = true  // 錯誤不阻止創建
            return res
        }
    }

    // ========== 4. 檢查是否有修改 ==========
    if reflect.DeepEqual(originalPod, &pod) {
        log.Info("No changes made, allowing")
        return admission.Allowed("no changes")
    }

    // ========== 5. 生成 JSONPatch ==========
    marshaledPod, err := json.Marshal(pod)
    if err != nil {
        log.Error(err, "Failed to marshal pod")
        res := admission.Errored(http.StatusInternalServerError, err)
        res.Allowed = true
        return res
    }

    log.Info("Returning patch response")
    return admission.PatchResponseFromRaw(req.Object.Raw, marshaledPod)
}

6.3 Sidecar 注入實現

6.3.1 Sidecar Mutator

檔案: internal/webhook/podmutation/sidecar_mutator.go

type sidecarMutator struct {
    client client.Client
    logger logr.Logger
    config config.Config
}

func (s *sidecarMutator) Mutate(
    ctx context.Context,
    ns corev1.Namespace,
    pod corev1.Pod,
) (corev1.Pod, error) {
    log := s.logger.WithValues("pod", pod.Name, "namespace", ns.Name)

    // ========== 1. 檢查是否需要注入 ==========
    // 註解:sidecar.opentelemetry.io/inject
    injectAnnotation, exists := pod.Annotations["sidecar.opentelemetry.io/inject"]
    if !exists || injectAnnotation == "false" {
        log.V(1).Info("Sidecar injection not requested")
        return pod, nil
    }

    log.Info("Sidecar injection requested", "value", injectAnnotation)

    // ========== 2. 查找 OpenTelemetryCollector ==========
    var collectorName string
    if injectAnnotation == "true" {
        // 自動查找(尋找同 namespace 下 mode=sidecar 的 Collector)
        collectorName, err := s.findSidecarCollector(ctx, ns.Name)
        if err != nil {
            return pod, err
        }
        log.Info("Auto-detected collector", "name", collectorName)
    } else {
        // 使用指定的 Collector
        collectorName = injectAnnotation
        log.Info("Using specified collector", "name", collectorName)
    }

    // 獲取 Collector CR
    collector := &v1beta1.OpenTelemetryCollector{}
    err := s.client.Get(ctx, types.NamespacedName{
        Name:      collectorName,
        Namespace: ns.Name,
    }, collector)
    if err != nil {
        log.Error(err, "Failed to get collector")
        return pod, err
    }

    // 驗證模式
    if collector.Spec.Mode != v1alpha1.ModeSidecar {
        return pod, fmt.Errorf("collector %s is not in sidecar mode", collectorName)
    }

    // ========== 3. 注入 Sidecar 容器 ==========
    sidecarContainer := s.buildSidecarContainer(collector)
    pod.Spec.Containers = append(pod.Spec.Containers, sidecarContainer)

    // ========== 4. 添加 Volumes ==========
    volumes := s.buildSidecarVolumes(collector)
    pod.Spec.Volumes = append(pod.Spec.Volumes, volumes...)

    // ========== 5. 修改應用容器環境變數 ==========
    // 將 OTLP endpoint 設置為 localhost
    for i := range pod.Spec.Containers {
        if pod.Spec.Containers[i].Name == sidecarContainer.Name {
            continue  // 跳過 sidecar 容器本身
        }

        pod.Spec.Containers[i].Env = append(
            pod.Spec.Containers[i].Env,
            corev1.EnvVar{
                Name:  "OTEL_EXPORTER_OTLP_ENDPOINT",
                Value: "http://localhost:4317",
            },
        )
    }

    log.Info("Sidecar injected successfully")
    return pod, nil
}

func (s *sidecarMutator) buildSidecarContainer(
    collector *v1beta1.OpenTelemetryCollector,
) corev1.Container {
    return corev1.Container{
        Name:  "otc-sidecar",
        Image: collector.Spec.Image,
        Args: []string{
            "--config=/conf/collector.yaml",
        },
        Ports: []corev1.ContainerPort{
            {Name: "otlp-grpc", ContainerPort: 4317},
            {Name: "otlp-http", ContainerPort: 4318},
        },
        VolumeMounts: []corev1.VolumeMount{
            {
                Name:      "otc-config",
                MountPath: "/conf",
            },
        },
        Resources: collector.Spec.Resources,
    }
}

6.3.2 Sidecar 注入效果

原始 Pod:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    sidecar.opentelemetry.io/inject: "true"
spec:
  containers:
    - name: app
      image: myapp:v1.0

注入後的 Pod:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    sidecar.opentelemetry.io/inject: "true"
    sidecar.opentelemetry.io/injected: "true"
spec:
  containers:
    # 原始應用容器(已修改)
    - name: app
      image: myapp:v1.0
      env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: http://localhost:4317

    # 注入的 Sidecar 容器
    - name: otc-sidecar
      image: otel/opentelemetry-collector:0.88.0
      args:
        - --config=/conf/collector.yaml
      ports:
        - name: otlp-grpc
          containerPort: 4317
        - name: otlp-http
          containerPort: 4318
      volumeMounts:
        - name: otc-config
          mountPath: /conf

  volumes:
    - name: otc-config
      configMap:
        name: my-sidecar-collector

七、開發環境完整設置

7.1 前置工具安裝

7.1.1 Go 語言環境

# ========== macOS ==========
brew install go@1.24

# 設置環境變數
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin

# 驗證
go version  # 應該顯示 go1.24 或更高

# ========== Linux ==========
# 下載並安裝
wget https://go.dev/dl/go1.24.0.linux-amd64.tar.gz
sudo rm -rf /usr/local/go
sudo tar -C /usr/local -xzf go1.24.0.linux-amd64.tar.gz

# 添加到 PATH(添加到 ~/.bashrc 或 ~/.zshrc)
export PATH=$PATH:/usr/local/go/bin
export GOPATH=$HOME/go
export PATH=$PATH:$GOPATH/bin

# 重新加載配置
source ~/.bashrc  # 或 source ~/.zshrc

7.1.2 Docker

# ========== macOS ==========
brew install --cask docker
# 或下載 Docker Desktop: https://www.docker.com/products/docker-desktop

# ========== Linux (Ubuntu/Debian) ==========
# 安裝依賴
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release

# 添加 Docker GPG key
sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | \
  sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

# 設置倉庫
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] \
  https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

# 安裝 Docker Engine
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-compose-plugin

# 允許非 root 用戶使用 Docker
sudo usermod -aG docker $USER
newgrp docker

# 驗證
docker --version
docker run hello-world

7.1.3 kubectl

# ========== macOS ==========
brew install kubectl

# ========== Linux ==========
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl

# 驗證
kubectl version --client

7.1.4 Kind (Kubernetes in Docker)

# ========== macOS ==========
brew install kind

# ========== Linux ==========
go install sigs.k8s.io/kind@v0.20.0

# 或使用二進制
curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.20.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind

# 驗證
kind version

7.1.5 Kustomize

# ========== macOS ==========
brew install kustomize

# ========== Linux ==========
curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash
sudo mv kustomize /usr/local/bin/

# 驗證
kustomize version

7.1.6 controller-gen (Kubebuilder 工具)

# 安裝 controller-gen(生成 CRD 和 deepcopy 代碼)
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

# 驗證
controller-gen --version

7.1.7 Operator SDK (可選)

# ========== macOS ==========
brew install operator-sdk

# ========== Linux ==========
export ARCH=$(case $(uname -m) in x86_64) echo -n amd64 ;; aarch64) echo -n arm64 ;; *) echo -n $(uname -m) ;; esac)
export OS=$(uname | awk '{print tolower($0)}')
export OPERATOR_SDK_DL_URL=https://github.com/operator-framework/operator-sdk/releases/download/v1.29.0

curl -LO ${OPERATOR_SDK_DL_URL}/operator-sdk_${OS}_${ARCH}
chmod +x operator-sdk_${OS}_${ARCH}
sudo mv operator-sdk_${OS}_${ARCH} /usr/local/bin/operator-sdk

# 驗證
operator-sdk version

7.2 專案設置

7.2.1 Clone 專案

# Clone OpenTelemetry Operator
git clone https://github.com/open-telemetry/opentelemetry-operator.git
cd opentelemetry-operator

# 查看分支
git branch -a

# 切換到最新的穩定分支(如果需要)
git checkout main

7.2.2 了解 Makefile

OpenTelemetry Operator 的 Makefile 提供了豐富的命令:

檔案: Makefile

# ========== 查看所有可用命令 ==========
make help

# 主要命令:
make manifests         # 生成 CRD YAML
make generate          # 生成 Go 代碼(deepcopy 等)
make fmt               # 格式化代碼
make vet               # 靜態分析
make test              # 運行單元測試
make docker-build      # 構建 Docker 鏡像
make install           # 安裝 CRD 到集群
make deploy            # 部署 Operator 到集群
make run               # 本地運行 Operator
make kind-cluster      # 創建 Kind 測試集群
make e2e               # 運行 E2E 測試

7.2.3 創建 Kind 集群

檔案: Makefile:101

# ========== 使用 Makefile 創建(推薦)==========
make kind-cluster

# 這個命令會:
# 1. 創建一個名為 "otel-operator" 的 Kind 集群
# 2. 使用配置文件 kind-1.33.yaml(或其他版本)
# 3. 自動安裝 cert-manager

# ========== 手動創建 ==========
# 創建集群配置文件
cat <<EOF > kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
    image: kindest/node:v1.33.0
  - role: worker
    image: kindest/node:v1.33.0
  - role: worker
    image: kindest/node:v1.33.0
EOF

# 創建集群
kind create cluster \
  --name otel-operator \
  --config kind-config.yaml

# 驗證集群
kubectl cluster-info --context kind-otel-operator
kubectl get nodes

7.2.4 安裝 cert-manager

檔案: Makefile:106 (CERTMANAGER_VERSION)

# ========== 方式 1: 使用 kubectl ==========
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 等待 cert-manager 就緒
kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager -n cert-manager

kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager-webhook -n cert-manager

kubectl wait --for=condition=Available --timeout=300s \
  deployment/cert-manager-cainjector -n cert-manager

# 驗證
kubectl get pods -n cert-manager

# ========== 方式 2: 使用 Helm ==========
helm repo add jetstack https://charts.jetstack.io
helm repo update

helm install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --create-namespace \
  --version v1.13.0 \
  --set installCRDs=true

# 驗證
kubectl get pods -n cert-manager

7.3 本地開發工作流

7.3.1 生成 CRD 和代碼

# ========== 1. 生成 CRD YAML ==========
make manifests

# 這會:
# - 從 apis/**/types.go 的註解生成 CRD YAML
# - 輸出到 config/crd/bases/
# - 使用 controller-gen 工具

# 查看生成的 CRD
ls -lh config/crd/bases/
# opentelemetry.io_instrumentations.yaml
# opentelemetry.io_opentelemetrycollectors.yaml
# opentelemetry.io_targetallocators.yaml
# opentelemetry.io_opampbridges.yaml

# ========== 2. 生成 Go 代碼 ==========
make generate

# 這會:
# - 生成 DeepCopy 方法(zz_generated.deepcopy.go)
# - 使用 controller-gen 工具

# 查看生成的代碼
find apis -name "zz_generated.*.go"

背後的命令(來自 Makefile):

# manifests 目標
controller-gen \
  crd:generateEmbeddedObjectMeta=true,maxDescLen=0 \
  rbac:roleName=manager-role \
  webhook \
  paths="./..." \
  output:crd:artifacts:config=config/crd/bases

# generate 目標
controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."

7.3.2 安裝 CRD 到集群

# ========== 安裝 CRD ==========
make install

# 等價於:
# kustomize build config/crd | kubectl apply -f -

# 驗證 CRD 已安裝
kubectl get crd | grep opentelemetry

# 應該看到:
# instrumentations.opentelemetry.io
# opampbridges.opentelemetry.io
# opentelemetrycollectors.opentelemetry.io
# targetallocators.opentelemetry.io

# 查看 CRD 詳情
kubectl explain opentelemetrycollector
kubectl explain opentelemetrycollector.spec
kubectl explain opentelemetrycollector.spec.config

7.3.3 本地運行 Operator

方式 1: 使用 Makefile(推薦)

檔案: Makefile:202-204

# 本地運行(不部署到集群)
make run

# 這會:
# 1. 執行 make generate fmt vet manifests
# 2. 運行 go run ./main.go --zap-devel
# 3. Operator 運行在本地,連接到 Kind 集群

# 默認禁用 Webhooks(本地運行時)
# 如果要啟用 Webhooks:
make run ENABLE_WEBHOOKS=true

方式 2: 直接使用 go run

# 設置環境變數
export WATCH_NAMESPACE=default  # 只監控 default namespace
# 或不設置,監控所有 namespace

# 禁用 Webhooks(本地運行通常禁用)
export ENABLE_WEBHOOKS=false

# 運行
go run ./main.go \
  --zap-devel \
  --metrics-addr=:8080 \
  --enable-leader-election=false \
  --health-probe-addr=:8081

# 參數說明:
# --zap-devel: 開發模式日誌(更詳細)
# --metrics-addr: Prometheus metrics 地址
# --enable-leader-election: 禁用 leader election(本地單實例)
# --health-probe-addr: 健康檢查地址

查看日誌輸出

2025-01-15T10:00:00.000Z    INFO    setup    Starting the OpenTelemetry Operator
2025-01-15T10:00:00.001Z    INFO    setup    opentelemetry-operator version    {"version": "0.92.0"}
2025-01-15T10:00:00.002Z    INFO    setup    build-date    {"date": "2025-01-15"}
2025-01-15T10:00:00.003Z    INFO    setup    go-version    {"version": "go1.24.0"}
2025-01-15T10:00:00.010Z    INFO    controller-runtime.metrics    Metrics server is starting to listen    {"addr": ":8080"}
2025-01-15T10:00:00.011Z    INFO    setup    starting manager
2025-01-15T10:00:00.011Z    INFO    controller    Starting EventSource    {"controller": "opentelemetrycollector", "source": "kind source: *v1beta1.OpenTelemetryCollector"}
2025-01-15T10:00:00.112Z    INFO    controller    Starting Controller    {"controller": "opentelemetrycollector"}
2025-01-15T10:00:00.112Z    INFO    controller    Starting workers    {"controller": "opentelemetrycollector", "worker count": 1}

7.3.4 測試 Operator

在本地 Operator 運行時,打開另一個終端

# ========== 創建測試 CR ==========
kubectl apply -f - <<EOF
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: test-local
spec:
  mode: deployment
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch: {}
    exporters:
      debug:
        verbosity: detailed
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]
EOF

# ========== 觀察 Operator 日誌 ==========
# 在運行 make run 的終端,你會看到:
# INFO    Reconciling OpenTelemetryCollector    {"opentelemetrycollector": "default/test-local"}
# INFO    Building desired objects
# INFO    Creating resource    {"kind": "ServiceAccount", "name": "test-local-collector"}
# INFO    Creating resource    {"kind": "ConfigMap", "name": "test-local-collector-abcd1234"}
# INFO    Creating resource    {"kind": "Deployment", "name": "test-local-collector"}
# INFO    Creating resource    {"kind": "Service", "name": "test-local-collector"}

# ========== 驗證資源 ==========
kubectl get otelcol test-local
kubectl get deployment test-local-collector
kubectl get service test-local-collector
kubectl get configmap -l app.kubernetes.io/instance=test-local

# ========== 查看 Collector 日誌 ==========
kubectl logs -f deployment/test-local-collector

# ========== 測試配置更新 ==========
kubectl patch otelcol test-local --type=merge -p '
{
  "spec": {
    "config": {
      "exporters": {
        "debug": {
          "verbosity": "normal"
        }
      }
    }
  }
}'

# 觀察 Operator 日誌,會看到:
# - 新 ConfigMap 被創建
# - Deployment 被更新(觸發滾動更新)

# ========== 清理 ==========
kubectl delete otelcol test-local

7.3.5 調試技巧

使用 Delve 調試器

# 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest

# 使用 Delve 運行
dlv debug ./main.go -- \
  --zap-devel \
  --enable-leader-election=false

# 設置斷點
(dlv) break internal/controllers/opentelemetrycollector_controller.go:234
(dlv) continue

# 創建 CR 後,斷點會被觸發

使用 VS Code 調試

創建 .vscode/launch.json

{
  "version": "0.2.0",
  "configurations": [
    {
      "name": "Debug Operator",
      "type": "go",
      "request": "launch",
      "mode": "debug",
      "program": "${workspaceFolder}/main.go",
      "env": {
        "ENABLE_WEBHOOKS": "false"
      },
      "args": [
        "--zap-devel",
        "--enable-leader-election=false"
      ]
    }
  ]
}

按 F5 開始調試,可以設置斷點、查看變數等。

7.3.6 部署到集群(完整流程)

# ========== 1. 構建鏡像 ==========
# 設置鏡像名稱
export IMG=localhost:5000/opentelemetry-operator:dev

make docker-build IMG=$IMG

# ========== 2. 推送到本地 registry(Kind 使用)==========
# 創建本地 registry(如果沒有)
docker run -d -p 5000:5000 --name kind-registry registry:2

# 推送鏡像
docker push $IMG

# 或直接加載到 Kind
kind load docker-image $IMG --name otel-operator

# ========== 3. 部署 Operator ==========
make deploy IMG=$IMG

# 這會:
# 1. 安裝 CRD
# 2. 創建 Namespace (opentelemetry-operator-system)
# 3. 部署 Operator
# 4. 創建 RBAC 資源
# 5. 設置 Webhooks

# ========== 4. 驗證部署 ==========
kubectl get pods -n opentelemetry-operator-system

# 應該看到:
# NAME                                                        READY   STATUS    RESTARTS   AGE
# opentelemetry-operator-controller-manager-xxxxxxxxx-xxxxx   2/2     Running   0          1m

# 查看日誌
kubectl logs -f -n opentelemetry-operator-system \
  deployment/opentelemetry-operator-controller-manager \
  -c manager

# ========== 5. 測試 ==========
kubectl apply -f config/samples/core_v1beta1_opentelemetrycollector.yaml

kubectl get otelcol
kubectl get all -l app.kubernetes.io/managed-by=opentelemetry-operator

# ========== 6. 清理 ==========
make undeploy

7.4 常用開發命令速查

# ========== 代碼生成 ==========
make manifests              # 生成 CRD
make generate               # 生成 deepcopy 代碼
make bundle                 # 生成 Operator bundle(OLM)

# ========== 代碼質量 ==========
make fmt                    # 格式化代碼
make vet                    # 靜態分析
make lint                   # Linting(需要 golangci-lint)
make test                   # 運行單元測試
make test-coverage          # 測試覆蓋率

# ========== 構建 ==========
make manager                # 構建 Operator 二進制
make targetallocator        # 構建 Target Allocator
make operator-opamp-bridge  # 構建 OpAMP Bridge
make docker-build           # 構建 Docker 鏡像

# ========== 部署 ==========
make install                # 安裝 CRD
make uninstall              # 卸載 CRD
make deploy                 # 部署 Operator
make undeploy               # 卸載 Operator
make run                    # 本地運行

# ========== 測試 ==========
make kind-cluster           # 創建 Kind 集群
make kind-delete-cluster    # 刪除 Kind 集群
make e2e                    # E2E 測試
make e2e-targetallocator    # Target Allocator E2E
make e2e-instrumentation    # Instrumentation E2E

# ========== 清理 ==========
make clean                  # 清理生成的文件

7.5 開發環境故障排查

7.5.1 常見問題

問題 1: controller-gen 未找到

# 錯誤信息
controller-gen: command not found

# 解決方案
go install sigs.k8s.io/controller-tools/cmd/controller-gen@latest

問題 2: Kind 集群無法訪問

# 檢查集群
kind get clusters

# 檢查 kubeconfig
kubectl config get-contexts

# 切換 context
kubectl config use-context kind-otel-operator

問題 3: cert-manager 未就緒

# 檢查 cert-manager pods
kubectl get pods -n cert-manager

# 查看日誌
kubectl logs -n cert-manager deployment/cert-manager

# 重新安裝
kubectl delete namespace cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

問題 4: Webhook 錯誤

# 錯誤信息
Error from server (InternalError): error when creating "test.yaml": Internal error occurred: failed calling webhook

# 解決方案(本地運行時)
export ENABLE_WEBHOOKS=false
make run

八、測試策略與實踐

8.1 測試金字塔

           /\
          /  \
         / E2E \ (少量,關鍵場景)
        /------\
       /  集成  \ (中等,API 交互)
      /----------\
     /  單元測試   \ (大量,快速反饋)
    /--------------\

OpenTelemetry Operator 的測試覆蓋:

  • 單元測試: internal/**/*_test.go

  • 集成測試: tests/e2e*/**

  • 性能測試: 負載測試腳本

8.2 單元測試深度實踐

8.2.1 測試結構

檔案: internal/controllers/reconcile_test.go

這個文件有 56KB,包含大量測試用例!

# 查看測試統計
wc -l internal/controllers/reconcile_test.go
# 約 1500+ 行

# 運行單元測試
make test

# 運行特定包
go test ./internal/controllers -v

# 運行特定測試
go test ./internal/controllers -v -run TestReconcile_Deployment

8.2.2 編寫單元測試範例

創建測試文件: internal/controllers/my_feature_test.go

package controllers_test

import (
    "context"
    "testing"

    "github.com/stretchr/testify/assert"
    "github.com/stretchr/testify/require"
    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    "sigs.k8s.io/controller-runtime/pkg/client/fake"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    "github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
    "github.com/open-telemetry/opentelemetry-operator/internal/controllers"
)

func TestReconcile_CustomFeature(t *testing.T) {
    // ========== 準備階段 ==========
    ctx := context.Background()

    // 創建測試用的 OpenTelemetryCollector
    otelCol := &v1beta1.OpenTelemetryCollector{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "test-collector",
            Namespace: "default",
        },
        Spec: v1beta1.OpenTelemetryCollectorSpec{
            Mode: v1alpha1.ModeDeployment,
            Config: v1beta1.Config{
                Receivers: v1beta1.AnyConfig{
                    Object: map[string]interface{}{
                        "otlp": map[string]interface{}{
                            "protocols": map[string]interface{}{
                                "grpc": map[string]interface{}{
                                    "endpoint": "0.0.0.0:4317",
                                },
                            },
                        },
                    },
                },
                Exporters: v1beta1.AnyConfig{
                    Object: map[string]interface{}{
                        "debug": map[string]interface{}{},
                    },
                },
                Service: v1beta1.Service{
                    Pipelines: map[string]*v1beta1.Pipeline{
                        "traces": {
                            Receivers: []string{"otlp"},
                            Exporters: []string{"debug"},
                        },
                    },
                },
            },
        },
    }

    // 創建 fake client
    scheme := runtime.NewScheme()
    _ = v1beta1.AddToScheme(scheme)
    _ = appsv1.AddToScheme(scheme)
    _ = corev1.AddToScheme(scheme)

    k8sClient := fake.NewClientBuilder().
        WithScheme(scheme).
        WithObjects(otelCol).
        WithStatusSubresource(&v1beta1.OpenTelemetryCollector{}).
        Build()

    // 創建 Reconciler
    reconciler := &controllers.OpenTelemetryCollectorReconciler{
        Client:   k8sClient,
        Scheme:   scheme,
        log:      logr.Discard(),
        recorder: record.NewFakeRecorder(10),
        config:   config.New(),
    }

    // ========== 執行階段 ==========
    req := reconcile.Request{
        NamespacedName: types.NamespacedName{
            Name:      "test-collector",
            Namespace: "default",
        },
    }

    result, err := reconciler.Reconcile(ctx, req)

    // ========== 驗證階段 ==========
    require.NoError(t, err)
    assert.False(t, result.Requeue)

    // 驗證 Deployment
    deployment := &appsv1.Deployment{}
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector-collector",
        Namespace: "default",
    }, deployment)
    require.NoError(t, err)

    // 驗證副本數
    assert.Equal(t, int32(1), *deployment.Spec.Replicas)

    // 驗證容器
    assert.Len(t, deployment.Spec.Template.Spec.Containers, 1)
    container := deployment.Spec.Template.Spec.Containers[0]
    assert.Equal(t, "otc-container", container.Name)

    // 驗證端口
    require.Len(t, container.Ports, 1)
    assert.Equal(t, "otlp-grpc", container.Ports[0].Name)
    assert.Equal(t, int32(4317), container.Ports[0].ContainerPort)

    // 驗證 Service
    service := &corev1.Service{}
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector-collector",
        Namespace: "default",
    }, service)
    require.NoError(t, err)
    assert.Len(t, service.Spec.Ports, 1)

    // 驗證 ConfigMap
    configMap := &corev1.ConfigMap{}
    configMaps := &corev1.ConfigMapList{}
    err = k8sClient.List(ctx, configMaps,
        client.InNamespace("default"),
        client.MatchingLabels{"app.kubernetes.io/instance": "test-collector"},
    )
    require.NoError(t, err)
    assert.Len(t, configMaps.Items, 1)

    // 驗證 Status
    err = k8sClient.Get(ctx, types.NamespacedName{
        Name:      "test-collector",
        Namespace: "default",
    }, otelCol)
    require.NoError(t, err)
    assert.NotEmpty(t, otelCol.Status.Version)
}

// Table-driven 測試範例
func TestReconcile_DifferentModes(t *testing.T) {
    tests := []struct {
        name     string
        mode     v1alpha1.Mode
        replicas *int32
        wantType string
    }{
        {
            name:     "Deployment mode",
            mode:     v1alpha1.ModeDeployment,
            replicas: ptr.To(int32(3)),
            wantType: "Deployment",
        },
        {
            name:     "DaemonSet mode",
            mode:     v1alpha1.ModeDaemonSet,
            replicas: nil,
            wantType: "DaemonSet",
        },
        {
            name:     "StatefulSet mode",
            mode:     v1alpha1.ModeStatefulSet,
            replicas: ptr.To(int32(2)),
            wantType: "StatefulSet",
        },
    }

    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            // 測試邏輯...
        })
    }
}

8.2.3 運行測試

# ========== 運行所有測試 ==========
make test

# ========== 運行特定包 ==========
go test ./internal/controllers -v

# ========== 運行特定測試 ==========
go test ./internal/controllers -v -run TestReconcile_Deployment

# ========== 運行測試並查看覆蓋率 ==========
go test ./... -coverprofile=coverage.out
go tool cover -html=coverage.out

# ========== 並行運行測試 ==========
go test ./... -parallel=4

# ========== 運行測試並顯示詳細輸出 ==========
go test ./internal/controllers -v -count=1

# ========== 測試特定功能 ==========
go test ./internal/manifests/collector -v -run TestDeployment

8.3 E2E 測試深度實踐

8.3.1 E2E 測試架構

OpenTelemetry Operator 使用 Chainsaw 進行 E2E 測試。

測試目錄結構

tests/
├── e2e/                       # 基礎功能
│   ├── smoke/                 # 煙霧測試
│   ├── smoke-targetallocator/ # TA 煙霧測試
│   └── smoke-sidecar/         # Sidecar 測試
├── e2e-instrumentation/       # 自動埋點測試
├── e2e-targetallocator/       # Target Allocator
├── e2e-opampbridge/          # OpAMP Bridge
└── e2e-upgrade/              # 升級測試

8.3.2 Chainsaw 測試範例

檔案: tests/e2e/smoke/00-install.yaml

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: simplest
spec:
  config:
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    exporters:
      debug:
        verbosity: detailed
    service:
      pipelines:
        traces:
          receivers: [otlp]
          exporters: [debug]

檔案: tests/e2e/smoke/01-assert.yaml

# 斷言 Deployment 被創建並就緒
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simplest-collector
spec:
  replicas: 1
status:
  readyReplicas: 1
  availableReplicas: 1
---
# 斷言 Service 被創建
apiVersion: v1
kind: Service
metadata:
  name: simplest-collector
spec:
  ports:
    - name: otlp-grpc
      port: 4317
      protocol: TCP
      targetPort: 4317

檔案: tests/e2e/smoke/chainsaw-test.yaml

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: smoke
spec:
  steps:
    - name: Install collector
      try:
        - apply:
            file: 00-install.yaml
        - assert:
            file: 01-assert.yaml
      catch:
        - describe:
            apiVersion: opentelemetry.io/v1beta1
            kind: OpenTelemetryCollector
        - describe:
            apiVersion: apps/v1
            kind: Deployment
        - podLogs:
            selector: app.kubernetes.io/name=simplest-collector

8.3.3 運行 E2E 測試

# ========== 1. 創建測試集群 ==========
make kind-cluster

# ========== 2. 構建並載入鏡像 ==========
make docker-build
make docker-build-targetallocator
make docker-build-operator-opamp-bridge

# 載入鏡像到 Kind
kind load docker-image \
  ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest \
  --name otel-operator

# ========== 3. 部署 Operator ==========
make deploy IMG=ghcr.io/open-telemetry/opentelemetry-operator/opentelemetry-operator:latest

# ========== 4. 運行 E2E 測試 ==========
make e2e

# 運行特定測試
make e2e-instrumentation
make e2e-targetallocator
make e2e-upgrade

# ========== 5. 查看測試結果 ==========
# Chainsaw 會輸出詳細的測試報告

# ========== 6. 清理 ==========
make kind-delete-cluster

8.3.4 編寫自定義 E2E 測試

創建測試目錄

mkdir -p tests/e2e-custom-feature
cd tests/e2e-custom-feature

創建測試清單 (00-install.yaml):

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: custom-test
spec:
  mode: deployment
  replicas: 2
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
    processors:
      batch:
        send_batch_size: 1000
        timeout: 10s
    exporters:
      debug: {}
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [debug]

創建斷言 (01-assert.yaml):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: custom-test-collector
spec:
  replicas: 2
status:
  readyReplicas: 2
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: (custom-test-collector-*)
data:
  collector.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317

創建 Chainsaw 測試 (chainsaw-test.yaml):

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: custom-feature
spec:
  description: Test custom feature
  timeouts:
    apply: 30s
    assert: 60s
  steps:
    - name: Install collector
      try:
        - apply:
            file: 00-install.yaml
        - assert:
            file: 01-assert.yaml
            timeout: 2m
      catch:
        - describe:
            apiVersion: opentelemetry.io/v1beta1
            kind: OpenTelemetryCollector
        - podLogs:
            selector: app.kubernetes.io/name=custom-test-collector

    - name: Test configuration update
      try:
        - patch:
            resource:
              apiVersion: opentelemetry.io/v1beta1
              kind: OpenTelemetryCollector
              metadata:
                name: custom-test
            merge:
              spec:
                replicas: 3
        - sleep:
            duration: 10s
        - assert:
            resource:
              apiVersion: apps/v1
              kind: Deployment
              metadata:
                name: custom-test-collector
              spec:
                replicas: 3

    - name: Cleanup
      try:
        - delete:
            ref:
              apiVersion: opentelemetry.io/v1beta1
              kind: OpenTelemetryCollector
              name: custom-test

運行自定義測試

# 安裝 Chainsaw
go install github.com/kyverno/chainsaw@latest

# 運行測試
chainsaw test --test-dir ./tests/e2e-custom-feature

8.4 測試覆蓋率

# ========== 生成覆蓋率報告 ==========
go test ./... -coverprofile=coverage.out

# ========== 查看覆蓋率摘要 ==========
go tool cover -func=coverage.out

# 輸出範例:
# github.com/open-telemetry/opentelemetry-operator/internal/controllers/opentelemetrycollector_controller.go:234:    Reconcile        85.7%
# github.com/open-telemetry/opentelemetry-operator/internal/manifests/collector/deployment.go:17:        Deployment        92.3%

# ========== 生成 HTML 報告 ==========
go tool cover -html=coverage.out -o coverage.html

# 在瀏覽器中打開
open coverage.html  # macOS
xdg-open coverage.html  # Linux

# ========== 按包查看覆蓋率 ==========
go test ./internal/controllers -coverprofile=controllers.out
go tool cover -func=controllers.out

8.5 性能測試與基準測試

8.5.1 Benchmark 測試

創建: internal/controllers/reconcile_bench_test.go

package controllers_test

import (
    "context"
    "testing"

    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"
    "sigs.k8s.io/controller-runtime/pkg/client/fake"
    "sigs.k8s.io/controller-runtime/pkg/reconcile"

    "github.com/open-telemetry/opentelemetry-operator/apis/v1beta1"
)

func BenchmarkReconcile(b *testing.B) {
    // 準備測試數據
    otelCol := &v1beta1.OpenTelemetryCollector{
        ObjectMeta: metav1.ObjectMeta{
            Name:      "benchmark-test",
            Namespace: "default",
        },
        Spec: v1beta1.OpenTelemetryCollectorSpec{
            Mode: v1alpha1.ModeDeployment,
            Config: v1beta1.Config{
                // ... 配置
            },
        },
    }

    k8sClient := fake.NewClientBuilder().
        WithObjects(otelCol).
        Build()

    reconciler := &controllers.OpenTelemetryCollectorReconciler{
        Client: k8sClient,
        // ... 其他字段
    }

    req := reconcile.Request{
        NamespacedName: types.NamespacedName{
            Name:      "benchmark-test",
            Namespace: "default",
        },
    }

    ctx := context.Background()

    // 運行 benchmark
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        _, err := reconciler.Reconcile(ctx, req)
        if err != nil {
            b.Fatal(err)
        }
    }
}

運行 Benchmark

# 運行 benchmark
go test -bench=. ./internal/controllers

# 輸出範例:
# BenchmarkReconcile-8           1000       1234567 ns/op

# 帶內存分析
go test -bench=. -benchmem ./internal/controllers

# 輸出範例:
# BenchmarkReconcile-8           1000       1234567 ns/op      123456 B/op        1234 allocs/op

# 生成 CPU profile
go test -bench=. -cpuprofile=cpu.prof ./internal/controllers

# 分析 profile
go tool pprof cpu.prof

九、實戰專案:Nginx Operator

在本章中,我們將從零開始實作一個完整的 Nginx Operator,應用前面學到的所有概念。

9.1 專案目標與架構

9.1.1 功能需求

我們的 Nginx Operator 需要支援:

  1. 自動部署 Nginx:根據 CR 創建 Deployment

  2. 配置管理:支援自定義 nginx.conf

  3. 服務暴露:自動創建 Service

  4. 配置熱更新:ConfigMap 變更後自動觸發滾動更新

  5. 版本管理:支援 Nginx 版本升級

  6. 健康檢查:配置 liveness 和 readiness probe

9.1.2 CRD 設計

API 定義 (apis/v1alpha1/nginx_types.go):

package v1alpha1

import (
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

// NginxSpec defines the desired state of Nginx
type NginxSpec struct {
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    // +kubebuilder:default=1
    Replicas *int32 `json:"replicas,omitempty"`

    // +kubebuilder:validation:Pattern=`^nginx:.*$`
    // +kubebuilder:default="nginx:1.25"
    Image string `json:"image,omitempty"`

    // +optional
    Resources corev1.ResourceRequirements `json:"resources,omitempty"`

    // +optional
    Config string `json:"config,omitempty"`

    // +optional
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    // +kubebuilder:default=80
    Port int32 `json:"port,omitempty"`

    // +optional
    // +kubebuilder:validation:Enum=ClusterIP;NodePort;LoadBalancer
    // +kubebuilder:default=ClusterIP
    ServiceType corev1.ServiceType `json:"serviceType,omitempty"`
}

// NginxStatus defines the observed state of Nginx
type NginxStatus struct {
    // Conditions represent the latest available observations of the Nginx state
    // +optional
    Conditions []metav1.Condition `json:"conditions,omitempty"`

    // ReadyReplicas is the number of ready replicas
    // +optional
    ReadyReplicas int32 `json:"readyReplicas,omitempty"`

    // ConfigVersion tracks the version of the ConfigMap
    // +optional
    ConfigVersion string `json:"configVersion,omitempty"`

    // LastUpdateTime is the last time the status was updated
    // +optional
    LastUpdateTime *metav1.Time `json:"lastUpdateTime,omitempty"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:subresource:scale:specpath=.spec.replicas,statuspath=.status.readyReplicas
// +kubebuilder:printcolumn:name="Replicas",type=integer,JSONPath=`.spec.replicas`
// +kubebuilder:printcolumn:name="Ready",type=integer,JSONPath=`.status.readyReplicas`
// +kubebuilder:printcolumn:name="Image",type=string,JSONPath=`.spec.image`
// +kubebuilder:printcolumn:name="Age",type=date,JSONPath=`.metadata.creationTimestamp`

// Nginx is the Schema for the nginxes API
type Nginx struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   NginxSpec   `json:"spec,omitempty"`
    Status NginxStatus `json:"status,omitempty"`
}

// +kubebuilder:object:root=true

// NginxList contains a list of Nginx
type NginxList struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ListMeta `json:"metadata,omitempty"`
    Items           []Nginx `json:"items"`
}

func init() {
    SchemeBuilder.Register(&Nginx{}, &NginxList{})
}

關鍵註解說明

  • +kubebuilder:subresource:status:啟用 status subresource

  • +kubebuilder:subresource:scale:支援 kubectl scale 命令

  • +kubebuilder:printcolumn:自定義 kubectl get 輸出列

  • +kubebuilder:validation:欄位驗證規則

9.2 Controller 實作

9.2.1 Reconciler 主邏輯

Controller (internal/controller/nginx_controller.go):

package controller

import (
    "context"
    "crypto/sha256"
    "fmt"
    "time"

    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    apierrors "k8s.io/apimachinery/pkg/api/errors"
    "k8s.io/apimachinery/pkg/api/meta"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/runtime"
    "k8s.io/apimachinery/pkg/types"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    "sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
    "sigs.k8s.io/controller-runtime/pkg/log"

    webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)

const (
    nginxFinalizer = "nginx.webapp.example.com/finalizer"

    // Condition types
    TypeAvailable   = "Available"
    TypeProgressing = "Progressing"
    TypeDegraded    = "Degraded"
)

// NginxReconciler reconciles a Nginx object
type NginxReconciler struct {
    client.Client
    Scheme *runtime.Scheme
}

// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/finalizers,verbs=update
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=pods,verbs=get;list;watch

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)
    logger.Info("Reconciling Nginx", "namespace", req.Namespace, "name", req.Name)

    // 1. 獲取 Nginx 資源
    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Nginx resource not found, likely deleted")
            return ctrl.Result{}, nil
        }
        logger.Error(err, "Failed to get Nginx")
        return ctrl.Result{}, err
    }

    // 2. 處理刪除邏輯 (Finalizer)
    if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
        return r.reconcileDelete(ctx, nginx)
    }

    // 3. 添加 Finalizer
    if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
        controllerutil.AddFinalizer(nginx, nginxFinalizer)
        if err := r.Update(ctx, nginx); err != nil {
            logger.Error(err, "Failed to add finalizer")
            return ctrl.Result{}, err
        }
        return ctrl.Result{Requeue: true}, nil
    }

    // 4. Reconcile ConfigMap
    configMap, configVersion, err := r.reconcileConfigMap(ctx, nginx)
    if err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ConfigMapFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 5. Reconcile Deployment
    if err := r.reconcileDeployment(ctx, nginx, configVersion); err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "DeploymentFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 6. Reconcile Service
    if err := r.reconcileService(ctx, nginx); err != nil {
        r.setCondition(nginx, TypeDegraded, metav1.ConditionTrue, "ServiceFailed", err.Error())
        _ = r.Status().Update(ctx, nginx)
        return ctrl.Result{}, err
    }

    // 7. 更新 Status
    if err := r.updateStatus(ctx, nginx, configVersion); err != nil {
        logger.Error(err, "Failed to update status")
        return ctrl.Result{}, err
    }

    // 8. 設置成功 Condition
    r.setCondition(nginx, TypeAvailable, metav1.ConditionTrue, "ReconcileSuccess", "Nginx is available")
    r.setCondition(nginx, TypeDegraded, metav1.ConditionFalse, "ReconcileSuccess", "")
    if err := r.Status().Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }

    logger.Info("Successfully reconciled Nginx")
    return ctrl.Result{RequeueAfter: 30 * time.Second}, nil
}

// reconcileConfigMap 管理 ConfigMap
func (r *NginxReconciler) reconcileConfigMap(ctx context.Context, nginx *webappv1alpha1.Nginx) (*corev1.ConfigMap, string, error) {
    logger := log.FromContext(ctx)

    // 默認 nginx 配置
    defaultConfig := `
events {
    worker_connections 1024;
}

http {
    server {
        listen 80;
        location / {
            root /usr/share/nginx/html;
            index index.html;
        }
    }
}`

    config := nginx.Spec.Config
    if config == "" {
        config = defaultConfig
    }

    // 計算配置的版本(hash)
    hash := sha256.Sum256([]byte(config))
    configVersion := fmt.Sprintf("%x", hash[:8])

    configMap := &corev1.ConfigMap{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name + "-config",
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":     "nginx",
                "nginx":   nginx.Name,
                "version": configVersion,
            },
        },
        Data: map[string]string{
            "nginx.conf": config,
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, configMap, r.Scheme); err != nil {
        return nil, "", err
    }

    // 創建或更新 ConfigMap
    existing := &corev1.ConfigMap{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      configMap.Name,
        Namespace: configMap.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating ConfigMap", "name", configMap.Name)
            if err := r.Create(ctx, configMap); err != nil {
                return nil, "", err
            }
            return configMap, configVersion, nil
        }
        return nil, "", err
    }

    // 更新 ConfigMap
    existing.Data = configMap.Data
    existing.Labels = configMap.Labels
    logger.Info("Updating ConfigMap", "name", configMap.Name)
    if err := r.Update(ctx, existing); err != nil {
        return nil, "", err
    }

    return existing, configVersion, nil
}

// reconcileDeployment 管理 Deployment
func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
    logger := log.FromContext(ctx)

    replicas := int32(1)
    if nginx.Spec.Replicas != nil {
        replicas = *nginx.Spec.Replicas
    }

    image := "nginx:1.25"
    if nginx.Spec.Image != "" {
        image = nginx.Spec.Image
    }

    port := int32(80)
    if nginx.Spec.Port != 0 {
        port = nginx.Spec.Port
    }

    deployment := &appsv1.Deployment{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name,
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
        },
        Spec: appsv1.DeploymentSpec{
            Replicas: &replicas,
            Selector: &metav1.LabelSelector{
                MatchLabels: map[string]string{
                    "app":   "nginx",
                    "nginx": nginx.Name,
                },
            },
            Template: corev1.PodTemplateSpec{
                ObjectMeta: metav1.ObjectMeta{
                    Labels: map[string]string{
                        "app":           "nginx",
                        "nginx":         nginx.Name,
                        "config-version": configVersion,  // 配置版本變更時觸發滾動更新
                    },
                },
                Spec: corev1.PodSpec{
                    Containers: []corev1.Container{
                        {
                            Name:  "nginx",
                            Image: image,
                            Ports: []corev1.ContainerPort{
                                {
                                    Name:          "http",
                                    ContainerPort: port,
                                    Protocol:      corev1.ProtocolTCP,
                                },
                            },
                            VolumeMounts: []corev1.VolumeMount{
                                {
                                    Name:      "config",
                                    MountPath: "/etc/nginx/nginx.conf",
                                    SubPath:   "nginx.conf",
                                },
                            },
                            Resources: nginx.Spec.Resources,
                            LivenessProbe: &corev1.Probe{
                                ProbeHandler: corev1.ProbeHandler{
                                    HTTPGet: &corev1.HTTPGetAction{
                                        Path: "/",
                                        Port: intstr.FromInt(int(port)),
                                    },
                                },
                                InitialDelaySeconds: 10,
                                PeriodSeconds:       10,
                            },
                            ReadinessProbe: &corev1.Probe{
                                ProbeHandler: corev1.ProbeHandler{
                                    HTTPGet: &corev1.HTTPGetAction{
                                        Path: "/",
                                        Port: intstr.FromInt(int(port)),
                                    },
                                },
                                InitialDelaySeconds: 5,
                                PeriodSeconds:       5,
                            },
                        },
                    },
                    Volumes: []corev1.Volume{
                        {
                            Name: "config",
                            VolumeSource: corev1.VolumeSource{
                                ConfigMap: &corev1.ConfigMapVolumeSource{
                                    LocalObjectReference: corev1.LocalObjectReference{
                                        Name: nginx.Name + "-config",
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
        return err
    }

    // 創建或更新 Deployment
    existing := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      deployment.Name,
        Namespace: deployment.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating Deployment", "name", deployment.Name)
            return r.Create(ctx, deployment)
        }
        return err
    }

    // 更新 Deployment
    existing.Spec = deployment.Spec
    logger.Info("Updating Deployment", "name", deployment.Name)
    return r.Update(ctx, existing)
}

// reconcileService 管理 Service
func (r *NginxReconciler) reconcileService(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    logger := log.FromContext(ctx)

    port := int32(80)
    if nginx.Spec.Port != 0 {
        port = nginx.Spec.Port
    }

    serviceType := corev1.ServiceTypeClusterIP
    if nginx.Spec.ServiceType != "" {
        serviceType = nginx.Spec.ServiceType
    }

    service := &corev1.Service{
        ObjectMeta: metav1.ObjectMeta{
            Name:      nginx.Name,
            Namespace: nginx.Namespace,
            Labels: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
        },
        Spec: corev1.ServiceSpec{
            Type: serviceType,
            Selector: map[string]string{
                "app":   "nginx",
                "nginx": nginx.Name,
            },
            Ports: []corev1.ServicePort{
                {
                    Name:       "http",
                    Protocol:   corev1.ProtocolTCP,
                    Port:       port,
                    TargetPort: intstr.FromInt(int(port)),
                },
            },
        },
    }

    // 設置 Owner Reference
    if err := controllerutil.SetControllerReference(nginx, service, r.Scheme); err != nil {
        return err
    }

    // 創建或更新 Service
    existing := &corev1.Service{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      service.Name,
        Namespace: service.Namespace,
    }, existing)

    if err != nil {
        if apierrors.IsNotFound(err) {
            logger.Info("Creating Service", "name", service.Name)
            return r.Create(ctx, service)
        }
        return err
    }

    // 更新 Service(保留 ClusterIP)
    existing.Spec.Ports = service.Spec.Ports
    existing.Spec.Selector = service.Spec.Selector
    existing.Spec.Type = service.Spec.Type
    logger.Info("Updating Service", "name", service.Name)
    return r.Update(ctx, existing)
}

// updateStatus 更新 Nginx 狀態
func (r *NginxReconciler) updateStatus(ctx context.Context, nginx *webappv1alpha1.Nginx, configVersion string) error {
    // 獲取 Deployment 狀態
    deployment := &appsv1.Deployment{}
    err := r.Get(ctx, types.NamespacedName{
        Name:      nginx.Name,
        Namespace: nginx.Namespace,
    }, deployment)
    if err != nil {
        return err
    }

    // 更新狀態
    nginx.Status.ReadyReplicas = deployment.Status.ReadyReplicas
    nginx.Status.ConfigVersion = configVersion
    now := metav1.Now()
    nginx.Status.LastUpdateTime = &now

    // 設置 Progressing condition
    if deployment.Status.UpdatedReplicas < *deployment.Spec.Replicas {
        r.setCondition(nginx, TypeProgressing, metav1.ConditionTrue, "Updating", "Deployment is being updated")
    } else {
        r.setCondition(nginx, TypeProgressing, metav1.ConditionFalse, "Updated", "All replicas are updated")
    }

    return r.Status().Update(ctx, nginx)
}

// reconcileDelete 處理刪除邏輯
func (r *NginxReconciler) reconcileDelete(ctx context.Context, nginx *webappv1alpha1.Nginx) (ctrl.Result, error) {
    logger := log.FromContext(ctx)
    logger.Info("Deleting Nginx", "name", nginx.Name)

    if controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
        // 執行清理邏輯(例如清理外部資源)
        logger.Info("Performing cleanup for Nginx", "name", nginx.Name)

        // 移除 finalizer
        controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
        if err := r.Update(ctx, nginx); err != nil {
            return ctrl.Result{}, err
        }
    }

    return ctrl.Result{}, nil
}

// setCondition 設置 Condition
func (r *NginxReconciler) setCondition(nginx *webappv1alpha1.Nginx, condType string, status metav1.ConditionStatus, reason, message string) {
    condition := metav1.Condition{
        Type:               condType,
        Status:             status,
        Reason:             reason,
        Message:            message,
        LastTransitionTime: metav1.Now(),
        ObservedGeneration: nginx.Generation,
    }
    meta.SetStatusCondition(&nginx.Status.Conditions, condition)
}

// SetupWithManager sets up the controller with the Manager.
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Service{}).
        Owns(&corev1.ConfigMap{}).
        Complete(r)
}

9.2.2 核心概念詳解

1. ConfigMap 版本管理

// 計算配置的 SHA256 hash 作為版本號
hash := sha256.Sum256([]byte(config))
configVersion := fmt.Sprintf("%x", hash[:8])

// 將版本號添加到 Pod labels
Labels: map[string]string{
    "config-version": configVersion,  // 版本變更時自動觸發 rolling update
}

2. Owner Reference 自動清理

// 設置 Owner Reference 後,刪除 Nginx 時會自動刪除子資源
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

3. Finalizer 清理邏輯

// 添加 finalizer 防止資源被立即刪除
if !controllerutil.ContainsFinalizer(nginx, nginxFinalizer) {
    controllerutil.AddFinalizer(nginx, nginxFinalizer)
    if err := r.Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }
}

// 刪除時執行清理
if !nginx.ObjectMeta.DeletionTimestamp.IsZero() {
    // 執行清理邏輯
    // ...

    // 移除 finalizer 允許刪除
    controllerutil.RemoveFinalizer(nginx, nginxFinalizer)
    if err := r.Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }
}

9.3 使用範例

9.3.1 部署 Operator

# 1. 生成 CRD 和 RBAC manifests
make manifests

# 2. 安裝 CRD
make install

# 3. 運行 operator(開發模式)
make run

# 或者部署到集群
make docker-build docker-push IMG=your-registry/nginx-operator:v1.0.0
make deploy IMG=your-registry/nginx-operator:v1.0.0

9.3.2 創建 Nginx 實例

基本範例 (config/samples/nginx_basic.yaml):

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
  image: nginx:1.25
  port: 80
  serviceType: ClusterIP

自定義配置範例 (config/samples/nginx_custom.yaml):

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-custom
  namespace: default
spec:
  replicas: 2
  image: nginx:1.25
  port: 8080
  serviceType: LoadBalancer
  resources:
    requests:
      memory: "128Mi"
      cpu: "100m"
    limits:
      memory: "256Mi"
      cpu: "200m"
  config: |
    events {
        worker_connections 2048;
    }

    http {
        log_format main '$remote_addr - $remote_user [$time_local] "$request" '
                        '$status $body_bytes_sent "$http_referer" '
                        '"$http_user_agent" "$http_x_forwarded_for"';

        access_log /var/log/nginx/access.log main;

        upstream backend {
            server backend-1.example.com;
            server backend-2.example.com;
        }

        server {
            listen 8080;

            location / {
                proxy_pass http://backend;
                proxy_set_header Host $host;
                proxy_set_header X-Real-IP $remote_addr;
            }

            location /health {
                access_log off;
                return 200 "healthy\n";
            }
        }
    }

9.3.3 測試和驗證

# 1. 創建 Nginx 實例
kubectl apply -f config/samples/nginx_basic.yaml

# 2. 查看 Nginx 狀態
kubectl get nginx nginx-sample

# 輸出:
# NAME           REPLICAS   READY   IMAGE         AGE
# nginx-sample   3          3       nginx:1.25    2m

# 3. 查看詳細狀態
kubectl describe nginx nginx-sample

# 4. 查看生成的資源
kubectl get deployment,svc,cm -l nginx=nginx-sample

# 5. 測試服務
kubectl port-forward svc/nginx-sample 8080:80
curl http://localhost:8080

# 6. 更新配置(觸發滾動更新)
kubectl edit nginx nginx-sample
# 修改 spec.config 或 spec.replicas

# 7. 觀察滾動更新
kubectl rollout status deployment/nginx-sample

# 8. 查看 Conditions
kubectl get nginx nginx-sample -o jsonpath='{.status.conditions}' | jq

# 9. Scale 測試
kubectl scale nginx nginx-sample --replicas=5
kubectl get nginx nginx-sample

# 10. 刪除測試
kubectl delete nginx nginx-sample
# 驗證子資源被自動清理
kubectl get deployment,svc,cm -l nginx=nginx-sample

9.4 進階功能實作

9.4.1 支援 Ingress 自動創建

擴展 CRD

type NginxSpec struct {
    // ... 現有字段

    // +optional
    Ingress *IngressSpec `json:"ingress,omitempty"`
}

type IngressSpec struct {
    // +kubebuilder:validation:Required
    Host string `json:"host"`

    // +optional
    TLS bool `json:"tls,omitempty"`

    // +optional
    Annotations map[string]string `json:"annotations,omitempty"`
}

Controller 添加 Ingress reconcile

func (r *NginxReconciler) reconcileIngress(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    if nginx.Spec.Ingress == nil {
        return nil
    }

    pathType := networkingv1.PathTypePrefix
    ingress := &networkingv1.Ingress{
        ObjectMeta: metav1.ObjectMeta{
            Name:        nginx.Name,
            Namespace:   nginx.Namespace,
            Annotations: nginx.Spec.Ingress.Annotations,
        },
        Spec: networkingv1.IngressSpec{
            Rules: []networkingv1.IngressRule{
                {
                    Host: nginx.Spec.Ingress.Host,
                    IngressRuleValue: networkingv1.IngressRuleValue{
                        HTTP: &networkingv1.HTTPIngressRuleValue{
                            Paths: []networkingv1.HTTPIngressPath{
                                {
                                    Path:     "/",
                                    PathType: &pathType,
                                    Backend: networkingv1.IngressBackend{
                                        Service: &networkingv1.IngressServiceBackend{
                                            Name: nginx.Name,
                                            Port: networkingv1.ServiceBackendPort{
                                                Number: nginx.Spec.Port,
                                            },
                                        },
                                    },
                                },
                            },
                        },
                    },
                },
            },
        },
    }

    if nginx.Spec.Ingress.TLS {
        ingress.Spec.TLS = []networkingv1.IngressTLS{
            {
                Hosts:      []string{nginx.Spec.Ingress.Host},
                SecretName: nginx.Name + "-tls",
            },
        }
    }

    if err := controllerutil.SetControllerReference(nginx, ingress, r.Scheme); err != nil {
        return err
    }

    // 創建或更新邏輯...
    return nil
}

9.4.2 支援多端口暴露

擴展 CRD

type PortSpec struct {
    // +kubebuilder:validation:Required
    Name string `json:"name"`

    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=65535
    Port int32 `json:"port"`

    // +optional
    // +kubebuilder:default="TCP"
    Protocol corev1.Protocol `json:"protocol,omitempty"`
}

type NginxSpec struct {
    // ... 現有字段

    // +optional
    AdditionalPorts []PortSpec `json:"additionalPorts,omitempty"`
}

Controller 處理多端口

func (r *NginxReconciler) buildServicePorts(nginx *webappv1alpha1.Nginx) []corev1.ServicePort {
    ports := []corev1.ServicePort{
        {
            Name:       "http",
            Port:       nginx.Spec.Port,
            TargetPort: intstr.FromInt(int(nginx.Spec.Port)),
            Protocol:   corev1.ProtocolTCP,
        },
    }

    for _, p := range nginx.Spec.AdditionalPorts {
        ports = append(ports, corev1.ServicePort{
            Name:       p.Name,
            Port:       p.Port,
            TargetPort: intstr.FromInt(int(p.Port)),
            Protocol:   p.Protocol,
        })
    }

    return ports
}

9.5 完整測試範例

9.5.1 單元測試

Controller 測試 (internal/controller/nginx_controller_test.go):

package controller

import (
    "context"
    "time"

    . "github.com/onsi/ginkgo/v2"
    . "github.com/onsi/gomega"
    appsv1 "k8s.io/api/apps/v1"
    corev1 "k8s.io/api/core/v1"
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
    "k8s.io/apimachinery/pkg/types"

    webappv1alpha1 "example.com/nginx-operator/apis/v1alpha1"
)

var _ = Describe("Nginx Controller", func() {
    const (
        NginxName      = "test-nginx"
        NginxNamespace = "default"
        timeout        = time.Second * 10
        interval       = time.Millisecond * 250
    )

    Context("When creating a Nginx resource", func() {
        It("Should create Deployment, Service, and ConfigMap", func() {
            ctx := context.Background()

            nginx := &webappv1alpha1.Nginx{
                ObjectMeta: metav1.ObjectMeta{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                },
                Spec: webappv1alpha1.NginxSpec{
                    Replicas: pointer.Int32(2),
                    Image:    "nginx:1.25",
                    Port:     80,
                },
            }

            Expect(k8sClient.Create(ctx, nginx)).Should(Succeed())

            // 驗證 Deployment 被創建
            deployment := &appsv1.Deployment{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(*deployment.Spec.Replicas).Should(Equal(int32(2)))
            Expect(deployment.Spec.Template.Spec.Containers[0].Image).Should(Equal("nginx:1.25"))

            // 驗證 Service 被創建
            service := &corev1.Service{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, service)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(service.Spec.Ports[0].Port).Should(Equal(int32(80)))

            // 驗證 ConfigMap 被創建
            configMap := &corev1.ConfigMap{}
            Eventually(func() bool {
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName + "-config",
                    Namespace: NginxNamespace,
                }, configMap)
                return err == nil
            }, timeout, interval).Should(BeTrue())

            Expect(configMap.Data).Should(HaveKey("nginx.conf"))
        })
    })

    Context("When updating Nginx config", func() {
        It("Should trigger rolling update", func() {
            ctx := context.Background()

            // 獲取當前 Nginx
            nginx := &webappv1alpha1.Nginx{}
            Expect(k8sClient.Get(ctx, types.NamespacedName{
                Name:      NginxName,
                Namespace: NginxNamespace,
            }, nginx)).Should(Succeed())

            // 更新配置
            nginx.Spec.Config = "# updated config"
            Expect(k8sClient.Update(ctx, nginx)).Should(Succeed())

            // 驗證 ConfigMap 被更新
            configMap := &corev1.ConfigMap{}
            Eventually(func() string {
                k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName + "-config",
                    Namespace: NginxNamespace,
                }, configMap)
                return configMap.Data["nginx.conf"]
            }, timeout, interval).Should(ContainSubstring("updated config"))

            // 驗證 Deployment Pod template 的 config-version label 改變
            deployment := &appsv1.Deployment{}
            oldVersion := ""
            Eventually(func() bool {
                k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                newVersion := deployment.Spec.Template.Labels["config-version"]
                if oldVersion == "" {
                    oldVersion = newVersion
                    return false
                }
                return newVersion != oldVersion
            }, timeout, interval).Should(BeTrue())
        })
    })

    Context("When deleting Nginx", func() {
        It("Should cleanup all resources", func() {
            ctx := context.Background()

            nginx := &webappv1alpha1.Nginx{}
            Expect(k8sClient.Get(ctx, types.NamespacedName{
                Name:      NginxName,
                Namespace: NginxNamespace,
            }, nginx)).Should(Succeed())

            Expect(k8sClient.Delete(ctx, nginx)).Should(Succeed())

            // 驗證所有子資源被刪除
            Eventually(func() bool {
                deployment := &appsv1.Deployment{}
                err := k8sClient.Get(ctx, types.NamespacedName{
                    Name:      NginxName,
                    Namespace: NginxNamespace,
                }, deployment)
                return errors.IsNotFound(err)
            }, timeout, interval).Should(BeTrue())
        })
    })
})

9.5.2 E2E 測試(Chainsaw)

測試套件 (tests/e2e/nginx/chainsaw-test.yaml):

apiVersion: chainsaw.kyverno.io/v1alpha1
kind: Test
metadata:
  name: nginx-e2e
spec:
  timeouts:
    apply: 30s
    assert: 1m
    cleanup: 30s
  steps:
    - name: Create Nginx instance
      try:
        - apply:
            file: 00-nginx-sample.yaml
        - assert:
            file: 01-assert-nginx-ready.yaml

    - name: Test service connectivity
      try:
        - script:
            content: |
              kubectl run curl-test --image=curlimages/curl:latest --rm -i --restart=Never -- \
                curl -s http://nginx-sample.default.svc.cluster.local
            check:
              ($error == null): true

    - name: Update configuration
      try:
        - apply:
            file: 02-update-config.yaml
        - assert:
            file: 03-assert-rolling-update.yaml

    - name: Scale up
      try:
        - script:
            content: |
              kubectl scale nginx nginx-sample --replicas=5
        - assert:
            file: 04-assert-scaled.yaml

    - name: Cleanup
      try:
        - delete:
            ref:
              apiVersion: webapp.example.com/v1alpha1
              kind: Nginx
              name: nginx-sample
        - assert:
            file: 05-assert-deleted.yaml

00-nginx-sample.yaml

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
  image: nginx:1.25
  port: 80
  serviceType: ClusterIP

01-assert-nginx-ready.yaml

apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: nginx-sample
  namespace: default
status:
  readyReplicas: 3
  conditions:
    - type: Available
      status: "True"
      reason: ReconcileSuccess
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-sample
  namespace: default
spec:
  replicas: 3
status:
  readyReplicas: 3
  updatedReplicas: 3
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-sample
  namespace: default
spec:
  type: ClusterIP
  ports:
    - port: 80

9.6 部署到生產環境

9.6.1 Kustomize 配置

Base (config/default/kustomization.yaml):

namePrefix: nginx-operator-
namespace: nginx-operator-system

resources:
  - ../crd
  - ../rbac
  - ../manager

images:
  - name: controller
    newName: your-registry/nginx-operator
    newTag: v1.0.0

Overlay for Production (config/overlays/production/kustomization.yaml):

bases:
  - ../../default

patchesStrategicMerge:
  - manager_resources.yaml

configMapGenerator:
  - name: manager-config
    literals:
      - LOG_LEVEL=info
      - ENABLE_WEBHOOKS=true

manager_resources.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: controller-manager
  namespace: system
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: manager
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

9.6.2 部署命令

# 構建鏡像
make docker-build IMG=your-registry/nginx-operator:v1.0.0

# 推送鏡像
make docker-push IMG=your-registry/nginx-operator:v1.0.0

# 部署到生產環境
kubectl apply -k config/overlays/production

# 驗證部署
kubectl get deployment -n nginx-operator-system
kubectl get pods -n nginx-operator-system

# 查看日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager -f

十、進階主題與最佳實踐

10.1 性能優化

10.1.1 使用 Field Indexer 加速查詢

當需要頻繁根據特定字段查詢資源時,Field Indexer 可以大幅提升性能。

範例:根據 Owner 快速查詢 Pod

參考 OpenTelemetry Operator 的實作 (main.go:126-139):

// 為 Pod 添加 owner.name 索引
if err := mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.Pod{},
    "metadata.ownerReferences.name",
    func(rawObj client.Object) []string {
        pod := rawObj.(*corev1.Pod)
        var owners []string
        for _, ref := range pod.GetOwnerReferences() {
            owners = append(owners, ref.Name)
        }
        return owners
    },
); err != nil {
    setupLog.Error(err, "failed to create pod index")
    os.Exit(1)
}

// 使用索引快速查詢
pods := &corev1.PodList{}
err := r.List(ctx, pods, client.MatchingFields{
    "metadata.ownerReferences.name": collectorName,
})

更多索引範例

// 1. 索引 ConfigMap 的標籤
mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.ConfigMap{},
    "metadata.labels.app",
    func(obj client.Object) []string {
        cm := obj.(*corev1.ConfigMap)
        if app, ok := cm.Labels["app"]; ok {
            return []string{app}
        }
        return nil
    },
)

// 查詢
configMaps := &corev1.ConfigMapList{}
r.List(ctx, configMaps, client.MatchingFields{
    "metadata.labels.app": "nginx",
})

// 2. 索引 Pod 的 Node
mgr.GetFieldIndexer().IndexField(
    context.Background(),
    &corev1.Pod{},
    "spec.nodeName",
    func(obj client.Object) []string {
        pod := obj.(*corev1.Pod)
        if pod.Spec.NodeName != "" {
            return []string{pod.Spec.NodeName}
        }
        return nil
    },
)

// 查詢特定 Node 上的 Pod
pods := &corev1.PodList{}
r.List(ctx, pods, client.MatchingFields{
    "spec.nodeName": "node-1",
})

10.1.2 使用 Cache 減少 API 調用

Controller Runtime 自動提供 cache,但需要正確配置:

// main.go - 配置 Cache
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
    Scheme: scheme,
    Cache: cache.Options{
        // 限制 cache 的 namespace(多租戶場景)
        Namespaces: []string{"namespace1", "namespace2"},

        // 配置同步週期
        SyncPeriod: pointer.Duration(10 * time.Minute),
    },
    // 配置 metrics 和 health 端點
    Metrics: server.Options{
        BindAddress: ":8080",
    },
    HealthProbeBindAddress: ":8081",
})

// Controller 中使用 cache(自動)
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // r.Get 自動從 cache 讀取
    nginx := &v1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        return ctrl.Result{}, err
    }

    // r.List 也從 cache 讀取
    deployments := &appsv1.DeploymentList{}
    if err := r.List(ctx, deployments, client.InNamespace(req.Namespace)); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil
}

10.1.3 優化 Reconcile 邏輯

1. 使用 Predicate 過濾不必要的事件

import (
    "sigs.k8s.io/controller-runtime/pkg/predicate"
)

func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        Owns(&appsv1.Deployment{}).
        Owns(&corev1.Service{}).
        WithEventFilter(predicate.Funcs{
            // 忽略 status-only 更新
            UpdateFunc: func(e event.UpdateEvent) bool {
                oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
                newObj := e.ObjectNew.(*webappv1alpha1.Nginx)

                // 只在 spec 或 metadata 改變時 reconcile
                return oldObj.Generation != newObj.Generation ||
                       !reflect.DeepEqual(oldObj.Labels, newObj.Labels) ||
                       !reflect.DeepEqual(oldObj.Annotations, newObj.Annotations)
            },
            // 忽略刪除事件(由 finalizer 處理)
            DeleteFunc: func(e event.DeleteEvent) bool {
                return false
            },
        }).
        Complete(r)
}

2. 批量處理更新

func (r *NginxReconciler) reconcileMultipleResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 並行處理多個資源
    var wg sync.WaitGroup
    errCh := make(chan error, 3)

    wg.Add(3)

    // ConfigMap
    go func() {
        defer wg.Done()
        if _, _, err := r.reconcileConfigMap(ctx, nginx); err != nil {
            errCh <- fmt.Errorf("configmap: %w", err)
        }
    }()

    // Deployment
    go func() {
        defer wg.Done()
        if err := r.reconcileDeployment(ctx, nginx, ""); err != nil {
            errCh <- fmt.Errorf("deployment: %w", err)
        }
    }()

    // Service
    go func() {
        defer wg.Done()
        if err := r.reconcileService(ctx, nginx); err != nil {
            errCh <- fmt.Errorf("service: %w", err)
        }
    }()

    wg.Wait()
    close(errCh)

    // 收集錯誤
    for err := range errCh {
        if err != nil {
            return err
        }
    }

    return nil
}

3. 使用 Generation 判斷資源變更

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // 如果 ObservedGeneration 和 Generation 相同,說明 spec 沒變
    if nginx.Status.ObservedGeneration == nginx.Generation {
        // 只檢查子資源狀態,不重新創建
        return r.reconcileStatus(ctx, nginx)
    }

    // spec 改變了,執行完整 reconcile
    return r.fullReconcile(ctx, nginx)
}

10.1.4 限制並發和 Rate Limiting

// main.go - 配置 Controller 選項
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        // ...
    })

    if err = (&controller.NginxReconciler{
        Client: mgr.GetClient(),
        Scheme: mgr.GetScheme(),
    }).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller")
        os.Exit(1)
    }
}

// SetupWithManager - 配置並發和限流
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        WithOptions(controller.Options{
            // 最大並發 reconcile 數
            MaxConcurrentReconciles: 3,

            // Rate Limiter
            RateLimiter: workqueue.NewItemExponentialFailureRateLimiter(
                5*time.Millisecond,  // base delay
                1000*time.Second,    // max delay
            ),
        }).
        Complete(r)
}

10.2 安全最佳實踐

10.2.1 RBAC 最小權限原則

僅授予必要權限

// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=webapp.example.com,resources=nginxes/status,verbs=get;update;patch
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete
// +kubebuilder:rbac:groups=core,resources=services;configmaps,verbs=get;list;watch;create;update;patch;delete

// 不要使用通配符
// 錯誤示範:
// +kubebuilder:rbac:groups=*,resources=*,verbs=*

使用 ServiceAccount 隔離

# config/rbac/service_account.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  name: nginx-operator
  namespace: nginx-operator-system
---
# config/rbac/role_binding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: nginx-operator-manager-rolebinding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: nginx-operator-manager-role
subjects:
  - kind: ServiceAccount
    name: nginx-operator
    namespace: nginx-operator-system

10.2.2 Webhook 安全

配置 TLS 和證書管理

# config/certmanager/certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: nginx-operator-serving-cert
  namespace: nginx-operator-system
spec:
  dnsNames:
    - nginx-operator-webhook-service.nginx-operator-system.svc
    - nginx-operator-webhook-service.nginx-operator-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: nginx-operator-selfsigned-issuer
  secretName: webhook-server-cert
---
apiVersion: cert-manager.io/v1
kind: Issuer
metadata:
  name: nginx-operator-selfsigned-issuer
  namespace: nginx-operator-system
spec:
  selfSigned: {}

Validating Webhook 範例

// +kubebuilder:webhook:path=/validate-webapp-example-com-v1alpha1-nginx,mutating=false,failurePolicy=fail,groups=webapp.example.com,resources=nginxes,verbs=create;update,versions=v1alpha1,name=vnginx.kb.io,sideEffects=None,admissionReviewVersions=v1

func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
    // 驗證 replicas 範圍
    if r.Spec.Replicas != nil && (*r.Spec.Replicas < 1 || *r.Spec.Replicas > 10) {
        return nil, fmt.Errorf("replicas must be between 1 and 10")
    }

    // 驗證 image 格式
    if !strings.HasPrefix(r.Spec.Image, "nginx:") {
        return nil, fmt.Errorf("image must start with 'nginx:'")
    }

    // 驗證 config 語法(可選)
    if r.Spec.Config != "" {
        if err := validateNginxConfig(r.Spec.Config); err != nil {
            return nil, fmt.Errorf("invalid nginx config: %w", err)
        }
    }

    return nil, nil
}

func (r *Nginx) ValidateUpdate(old runtime.Object) (admission.Warnings, error) {
    oldNginx := old.(*Nginx)

    // 防止降級(可選規則)
    if r.Spec.Image != "" && oldNginx.Spec.Image != "" {
        oldVersion := extractVersion(oldNginx.Spec.Image)
        newVersion := extractVersion(r.Spec.Image)
        if newVersion < oldVersion {
            return admission.Warnings{"Downgrading nginx version may cause issues"}, nil
        }
    }

    return r.ValidateCreate()
}

10.2.3 Secret 管理

從 Secret 讀取敏感配置

func (r *NginxReconciler) reconcileDeployment(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 從 Secret 掛載 TLS 證書
    volumes := []corev1.Volume{
        {
            Name: "config",
            VolumeSource: corev1.VolumeSource{
                ConfigMap: &corev1.ConfigMapVolumeSource{
                    LocalObjectReference: corev1.LocalObjectReference{
                        Name: nginx.Name + "-config",
                    },
                },
            },
        },
    }

    volumeMounts := []corev1.VolumeMount{
        {
            Name:      "config",
            MountPath: "/etc/nginx/nginx.conf",
            SubPath:   "nginx.conf",
        },
    }

    // 如果配置了 TLS
    if nginx.Spec.TLS != nil {
        volumes = append(volumes, corev1.Volume{
            Name: "tls",
            VolumeSource: corev1.VolumeSource{
                Secret: &corev1.SecretVolumeSource{
                    SecretName: nginx.Spec.TLS.SecretName,
                },
            },
        })

        volumeMounts = append(volumeMounts, corev1.VolumeMount{
            Name:      "tls",
            MountPath: "/etc/nginx/tls",
            ReadOnly:  true,
        })
    }

    // 使用在 Deployment spec 中...
}

加密 etcd 中的 Secret(集群級配置):

# /etc/kubernetes/manifests/kube-apiserver.yaml
apiVersion: v1
kind: Pod
metadata:
  name: kube-apiserver
spec:
  containers:
    - name: kube-apiserver
      command:
        - kube-apiserver
        - --encryption-provider-config=/etc/kubernetes/encryption-config.yaml
      volumeMounts:
        - name: encryption-config
          mountPath: /etc/kubernetes/encryption-config.yaml
          readOnly: true
  volumes:
    - name: encryption-config
      hostPath:
        path: /etc/kubernetes/encryption-config.yaml

10.3 可觀測性

10.3.1 結構化日誌

使用 logr 進行結構化日誌記錄

import (
    "sigs.k8s.io/controller-runtime/pkg/log"
)

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    logger := log.FromContext(ctx)

    // 基本日誌
    logger.Info("Reconciling Nginx",
        "namespace", req.Namespace,
        "name", req.Name,
    )

    // 帶錯誤的日誌
    if err := r.Get(ctx, req.NamespacedName, &nginx); err != nil {
        logger.Error(err, "Failed to get Nginx",
            "namespace", req.Namespace,
            "name", req.Name,
        )
        return ctrl.Result{}, err
    }

    // Debug 級別日誌
    logger.V(1).Info("Detailed debug info",
        "spec", nginx.Spec,
        "status", nginx.Status,
    )

    // 使用 WithValues 添加上下文
    logger = logger.WithValues(
        "nginx-version", nginx.Spec.Image,
        "replicas", nginx.Spec.Replicas,
    )

    logger.Info("Starting reconciliation")

    return ctrl.Result{}, nil
}

配置日誌級別 (main.go):

import (
    "flag"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
)

func main() {
    var logLevel int
    flag.IntVar(&logLevel, "log-level", 0, "Log level (0=info, 1=debug, 2=trace)")

    opts := zap.Options{
        Development: true,
        Level:       zapcore.Level(-logLevel),
    }

    ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

    // ...
}

10.3.2 Metrics 暴露

使用 Prometheus metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
    nginxReconcileTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "nginx_operator_reconcile_total",
            Help: "Total number of reconciliations",
        },
        []string{"namespace", "name", "result"},
    )

    nginxReconcileDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "nginx_operator_reconcile_duration_seconds",
            Help:    "Duration of reconciliations",
            Buckets: prometheus.DefBuckets,
        },
        []string{"namespace", "name"},
    )

    nginxCount = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "nginx_operator_nginx_count",
            Help: "Number of Nginx instances",
        },
        []string{"namespace"},
    )
)

func init() {
    // 註冊 metrics
    metrics.Registry.MustRegister(
        nginxReconcileTotal,
        nginxReconcileDuration,
        nginxCount,
    )
}

func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        nginxReconcileDuration.WithLabelValues(
            req.Namespace,
            req.Name,
        ).Observe(duration)
    }()

    nginx := &webappv1alpha1.Nginx{}
    if err := r.Get(ctx, req.NamespacedName, nginx); err != nil {
        nginxReconcileTotal.WithLabelValues(
            req.Namespace,
            req.Name,
            "error",
        ).Inc()
        return ctrl.Result{}, client.IgnoreNotFound(err)
    }

    // ... reconcile 邏輯

    nginxReconcileTotal.WithLabelValues(
        req.Namespace,
        req.Name,
        "success",
    ).Inc()

    nginxCount.WithLabelValues(req.Namespace).Set(1)

    return ctrl.Result{}, nil
}

配置 ServiceMonitor (Prometheus Operator):

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nginx-operator-metrics
  namespace: nginx-operator-system
spec:
  selector:
    matchLabels:
      control-plane: controller-manager
  endpoints:
    - port: metrics
      interval: 30s
      path: /metrics

10.3.3 健康檢查

配置 Health 和 Readiness Probes (main.go):

import (
    "sigs.k8s.io/controller-runtime/pkg/healthz"
)

func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        HealthProbeBindAddress: ":8081",
        // ...
    })

    // 添加 health check
    if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up health check")
        os.Exit(1)
    }

    // 添加 readiness check
    if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
        setupLog.Error(err, "unable to set up ready check")
        os.Exit(1)
    }

    // ...
}

Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  template:
    spec:
      containers:
        - name: manager
          image: nginx-operator:latest
          ports:
            - containerPort: 8080
              name: metrics
            - containerPort: 8081
              name: health
          livenessProbe:
            httpGet:
              path: /healthz
              port: health
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /readyz
              port: health
            initialDelaySeconds: 5
            periodSeconds: 10

10.4 生產環境部署

10.4.1 高可用性配置

多副本 + Leader Election

// main.go
func main() {
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        // 啟用 Leader Election
        LeaderElection:          true,
        LeaderElectionID:        "nginx-operator-lock",
        LeaderElectionNamespace: "nginx-operator-system",

        // 配置 lease 參數
        LeaseDuration: pointer.Duration(15 * time.Second),
        RenewDeadline: pointer.Duration(10 * time.Second),
        RetryPeriod:   pointer.Duration(2 * time.Second),
    })

    // ...
}

Deployment 高可用配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  replicas: 3
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
    spec:
      # Pod 反親和性 - 分散到不同節點
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    control-plane: controller-manager
                topologyKey: kubernetes.io/hostname

      # 優先級
      priorityClassName: system-cluster-critical

      containers:
        - name: manager
          image: nginx-operator:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 512Mi

          # 安全上下文
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsNonRoot: true
            runAsUser: 65532
            seccompProfile:
              type: RuntimeDefault

10.4.2 資源限制

配置 Resource Quotas

apiVersion: v1
kind: ResourceQuota
metadata:
  name: nginx-operator-quota
  namespace: nginx-operator-system
spec:
  hard:
    requests.cpu: "2"
    requests.memory: "2Gi"
    limits.cpu: "4"
    limits.memory: "4Gi"
    persistentvolumeclaims: "10"

配置 LimitRange

apiVersion: v1
kind: LimitRange
metadata:
  name: nginx-operator-limitrange
  namespace: nginx-operator-system
spec:
  limits:
    - max:
        cpu: "1"
        memory: "1Gi"
      min:
        cpu: "50m"
        memory: "64Mi"
      default:
        cpu: "200m"
        memory: "256Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      type: Container

10.4.3 監控告警

Prometheus 告警規則

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nginx-operator-alerts
  namespace: nginx-operator-system
spec:
  groups:
    - name: nginx-operator
      interval: 30s
      rules:
        - alert: NginxOperatorDown
          expr: up{job="nginx-operator-metrics"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Nginx Operator is down"
            description: "Nginx Operator has been down for more than 5 minutes"

        - alert: NginxOperatorHighReconcileErrors
          expr: rate(nginx_operator_reconcile_total{result="error"}[5m]) > 0.1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High reconcile error rate"
            description: "Nginx Operator has high reconcile error rate: {{ $value }}"

        - alert: NginxInstanceNotReady
          expr: nginx_operator_nginx_count{} - nginx_operator_nginx_ready{} > 0
          for: 15m
          labels:
            severity: warning
          annotations:
            summary: "Nginx instance not ready"
            description: "Nginx instance in {{ $labels.namespace }} is not ready for 15 minutes"

10.4.4 備份和恢復

備份 CRD 和 CR

#!/bin/bash
# backup-nginx-operator.sh

BACKUP_DIR="/backup/nginx-operator/$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BACKUP_DIR"

# 備份 CRD
kubectl get crd nginxes.webapp.example.com -o yaml > "$BACKUP_DIR/crd.yaml"

# 備份所有 Nginx CR
kubectl get nginx --all-namespaces -o yaml > "$BACKUP_DIR/nginx-crs.yaml"

# 備份 Operator 配置
kubectl get deployment,service,configmap,secret -n nginx-operator-system -o yaml > "$BACKUP_DIR/operator-resources.yaml"

echo "Backup completed: $BACKUP_DIR"

恢復流程

#!/bin/bash
# restore-nginx-operator.sh

BACKUP_DIR=$1

if [ -z "$BACKUP_DIR" ]; then
    echo "Usage: $0 <backup-directory>"
    exit 1
fi

# 1. 恢復 CRD
kubectl apply -f "$BACKUP_DIR/crd.yaml"

# 2. 恢復 Operator
kubectl apply -f "$BACKUP_DIR/operator-resources.yaml"

# 3. 等待 Operator ready
kubectl wait --for=condition=available --timeout=300s \
    deployment/nginx-operator-controller-manager -n nginx-operator-system

# 4. 恢復 Nginx CR
kubectl apply -f "$BACKUP_DIR/nginx-crs.yaml"

echo "Restore completed from: $BACKUP_DIR"

10.5 多租戶支援

10.5.1 Namespace 隔離

限制 Operator 監聽特定 Namespace

// main.go
func main() {
    // 從環境變數讀取允許的 namespaces
    watchNamespaces := os.Getenv("WATCH_NAMESPACES")
    var namespaces []string
    if watchNamespaces != "" {
        namespaces = strings.Split(watchNamespaces, ",")
    }

    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
        Scheme: scheme,
        Cache: cache.Options{
            Namespaces: namespaces,  // 只 watch 這些 namespaces
        },
    })

    // ...
}

Deployment 配置

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-operator-controller-manager
spec:
  template:
    spec:
      containers:
        - name: manager
          env:
            - name: WATCH_NAMESPACES
              value: "tenant-a,tenant-b,tenant-c"

10.5.2 資源配額

為每個租戶配置 ResourceQuota

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-nginx-quota
  namespace: tenant-a
spec:
  hard:
    count/nginxes.webapp.example.com: "10"  # 最多 10 個 Nginx 實例
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values: ["tenant-a"]

10.5.3 驗證租戶權限

Validating Webhook 檢查權限

func (r *Nginx) ValidateCreate() (admission.Warnings, error) {
    // 檢查租戶配額
    quota, err := getTenantQuota(r.Namespace)
    if err != nil {
        return nil, err
    }

    currentCount, err := getNginxCount(r.Namespace)
    if err != nil {
        return nil, err
    }

    if currentCount >= quota {
        return nil, fmt.Errorf(
            "tenant %s has reached quota limit: %d/%d",
            r.Namespace, currentCount, quota,
        )
    }

    // 檢查資源限制
    if r.Spec.Replicas != nil && *r.Spec.Replicas > quota.MaxReplicas {
        return nil, fmt.Errorf(
            "replicas %d exceeds tenant limit %d",
            *r.Spec.Replicas, quota.MaxReplicas,
        )
    }

    return nil, nil
}

十一、常見問題與調試技巧

11.1 常見錯誤及解決方案

11.1.1 CRD 相關錯誤

錯誤 1:CRD 未安裝

Error: the server could not find the requested resource (get nginxes.webapp.example.com)

解決方案

# 檢查 CRD 是否存在
kubectl get crd | grep nginx

# 安裝 CRD
make install

# 或手動安裝
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml

# 驗證
kubectl get crd nginxes.webapp.example.com -o yaml

錯誤 2:CRD 版本不匹配

error: error validating "nginx.yaml": error validating data: ValidationError(Nginx.spec): unknown field "newField"

解決方案

# 1. 更新 CRD 定義
make manifests

# 2. 重新安裝 CRD
kubectl replace -f config/crd/bases/webapp.example.com_nginxes.yaml

# 3. 如果 replace 失敗,需要先刪除(危險操作!會刪除所有 CR)
kubectl delete crd nginxes.webapp.example.com
make install

錯誤 3:Validation 規則不生效

# CRD 中定義了 validation,但創建時沒有被驗證
spec:
  replicas: 100  # 超過 maximum: 10 但沒報錯

解決方案

# 1. 確認 CRD 中包含 validation rules
kubectl get crd nginxes.webapp.example.com -o yaml | grep -A 10 validation

# 2. 重新生成並安裝 CRD
make manifests
kubectl apply -f config/crd/bases/webapp.example.com_nginxes.yaml

# 3. 驗證 validation 生效
kubectl apply -f - <<EOF
apiVersion: webapp.example.com/v1alpha1
kind: Nginx
metadata:
  name: test
spec:
  replicas: 100  # 應該報錯
EOF

11.1.2 Controller 錯誤

錯誤 1:Reconcile 死循環

# 日誌顯示不停 reconcile
INFO    Reconciling Nginx
INFO    Reconciling Nginx
INFO    Reconciling Nginx
...

原因分析

  1. Status 更新觸發了 reconcile

  2. 沒有正確設置 Predicate 過濾

  3. Requeue 邏輯錯誤

解決方案

// 1. 使用 Predicate 過濾 status-only 更新
func (r *NginxReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).
        For(&webappv1alpha1.Nginx{}).
        WithEventFilter(predicate.Funcs{
            UpdateFunc: func(e event.UpdateEvent) bool {
                oldObj := e.ObjectOld.(*webappv1alpha1.Nginx)
                newObj := e.ObjectNew.(*webappv1alpha1.Nginx)

                // 只在 spec 改變時 reconcile
                return oldObj.Generation != newObj.Generation
            },
        }).
        Complete(r)
}

// 2. 正確使用 Status().Update()
func (r *NginxReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    // ... reconcile logic

    // 使用 Status() subresource 更新,不會觸發新的 reconcile
    if err := r.Status().Update(ctx, nginx); err != nil {
        return ctrl.Result{}, err
    }

    return ctrl.Result{}, nil  // 不要返回 Requeue: true
}

錯誤 2:無法更新 Status

ERROR   Failed to update status    error="the server could not find the requested resource"

解決方案

// 確保 CRD 定義包含 status subresource
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status  // <-- 這一行很重要!

type Nginx struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   NginxSpec   `json:"spec,omitempty"`
    Status NginxStatus `json:"status,omitempty"`
}
# 重新生成並安裝 CRD
make manifests
make install

錯誤 3:Owner Reference 錯誤導致資源無法刪除

ERROR   Failed to delete Deployment    error="cannot delete resource, owner references are set"

解決方案

// 正確設置 Owner Reference
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

// 刪除時會自動清理子資源(Garbage Collection)
// 如果需要手動清理:
func (r *NginxReconciler) cleanupResources(ctx context.Context, nginx *webappv1alpha1.Nginx) error {
    // 列出所有關聯資源
    deployments := &appsv1.DeploymentList{}
    if err := r.List(ctx, deployments, client.InNamespace(nginx.Namespace), client.MatchingLabels{
        "nginx": nginx.Name,
    }); err != nil {
        return err
    }

    // 刪除資源
    for _, deployment := range deployments.Items {
        if err := r.Delete(ctx, &deployment); err != nil {
            return err
        }
    }

    return nil
}

11.1.3 RBAC 權限錯誤

錯誤:403 Forbidden

ERROR   Failed to list Deployments    error="deployments.apps is forbidden: User \"system:serviceaccount:nginx-operator-system:default\" cannot list resource \"deployments\" in API group \"apps\" in the namespace \"default\""

解決方案

# 1. 檢查 RBAC 註解是否正確
# Controller 代碼中:
// +kubebuilder:rbac:groups=apps,resources=deployments,verbs=get;list;watch;create;update;patch;delete

# 2. 重新生成 RBAC manifests
make manifests

# 3. 檢查生成的 Role/ClusterRole
cat config/rbac/role.yaml

# 4. 重新部署 Operator
kubectl apply -k config/default

# 5. 驗證 ServiceAccount 權限
kubectl auth can-i list deployments \
    --as=system:serviceaccount:nginx-operator-system:nginx-operator-controller-manager \
    -n default

跨 Namespace 權限問題

# 如果 Operator 需要操作多個 namespace,需要使用 ClusterRole
# 修改 config/default/kustomization.yaml:

# 註釋掉這一行(使用 Role)
# - ../rbac/role.yaml
# - ../rbac/role_binding.yaml

# 使用 ClusterRole
- ../rbac/cluster_role.yaml
- ../rbac/cluster_role_binding.yaml

11.1.4 Webhook 錯誤

錯誤:Webhook 連接超時

Error from server (InternalError): Internal error occurred: failed calling webhook "vnginx.kb.io": Post "https://nginx-operator-webhook-service.nginx-operator-system.svc:443/validate-webapp-example-com-v1alpha1-nginx?timeout=10s": context deadline exceeded

解決方案

# 1. 檢查 webhook service 是否存在
kubectl get svc -n nginx-operator-system

# 2. 檢查 webhook pod 是否運行
kubectl get pods -n nginx-operator-system

# 3. 檢查證書是否正確配置
kubectl get secret -n nginx-operator-system | grep webhook

# 4. 檢查 ValidatingWebhookConfiguration
kubectl get validatingwebhookconfiguration

# 5. 查看 webhook 配置詳情
kubectl get validatingwebhookconfiguration nginx-operator-validating-webhook-configuration -o yaml

# 6. 測試 webhook service 連接
kubectl run test-curl --image=curlimages/curl --rm -it --restart=Never -- \
    curl -k https://nginx-operator-webhook-service.nginx-operator-system.svc:443

# 7. 如果開發環境不需要 webhook,可以禁用
export ENABLE_WEBHOOKS=false
make run

錯誤:證書驗證失敗

Error: x509: certificate signed by unknown authority

解決方案

# 1. 確保安裝了 cert-manager
kubectl get pods -n cert-manager

# 如果沒有安裝:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.13.0/cert-manager.yaml

# 2. 檢查 Certificate 資源
kubectl get certificate -n nginx-operator-system

# 3. 檢查證書狀態
kubectl describe certificate nginx-operator-serving-cert -n nginx-operator-system

# 4. 手動觸發證書重新生成
kubectl delete certificate nginx-operator-serving-cert -n nginx-operator-system
kubectl apply -f config/certmanager/certificate.yaml

# 5. 等待證書 ready
kubectl wait --for=condition=ready certificate/nginx-operator-serving-cert \
    -n nginx-operator-system --timeout=300s

11.2 調試技巧

11.2.1 本地調試

使用 Delve 調試

# 1. 安裝 Delve
go install github.com/go-delve/delve/cmd/dlv@latest

# 2. 啟動調試模式
dlv debug ./main.go -- --zap-devel

# 3. 設置斷點
(dlv) break internal/controller/nginx_controller.go:60
(dlv) continue

# 4. 查看變量
(dlv) print nginx
(dlv) print err

# 5. 查看調用棧
(dlv) stack

VS Code 調試配置 (.vscode/launch.json):

{
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Debug Operator",
            "type": "go",
            "request": "launch",
            "mode": "debug",
            "program": "${workspaceFolder}/main.go",
            "args": ["--zap-devel"],
            "env": {
                "ENABLE_WEBHOOKS": "false",
                "KUBECONFIG": "${env:HOME}/.kube/config"
            }
        },
        {
            "name": "Attach to Process",
            "type": "go",
            "request": "attach",
            "mode": "local",
            "processId": "${command:pickProcess}"
        }
    ]
}

11.2.2 日誌分析

增加日誌詳細度

# 運行時指定 log level
make run -- --zap-log-level=debug

# 或在代碼中:
logger.V(1).Info("Debug message", "key", value)  # debug
logger.V(2).Info("Trace message", "key", value)  # trace

結構化日誌查詢

# 使用 jq 過濾日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | jq 'select(.level == "error")'

# 查找特定資源的日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
    jq 'select(.nginx == "nginx-sample")'

# 統計錯誤類型
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | \
    jq -r 'select(.level == "error") | .msg' | sort | uniq -c

11.2.3 性能分析

CPU Profiling

// main.go
import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    // 啟動 pprof server
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()

    // ... 其他代碼
}
# 收集 30 秒的 CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# 交互式分析
(pprof) top10
(pprof) list reconcileDeployment
(pprof) web  # 生成可視化圖表(需要 graphviz)

Memory Profiling

# 收集 heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# 分析內存分配
(pprof) top10
(pprof) list NginxReconciler.Reconcile

# 查看內存洩漏
go tool pprof -alloc_space http://localhost:6060/debug/pprof/heap

Goroutine 分析

# 查看所有 goroutine
curl http://localhost:6060/debug/pprof/goroutine?debug=2

# 分析 goroutine 洩漏
go tool pprof http://localhost:6060/debug/pprof/goroutine

11.2.4 資源狀態檢查

完整的資源檢查腳本

#!/bin/bash
# debug-nginx.sh - Nginx Operator 診斷腳本

NAMESPACE=${1:-default}
NGINX_NAME=${2:-nginx-sample}

echo "=== Nginx Resource ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== Nginx Status ==="
kubectl get nginx $NGINX_NAME -n $NAMESPACE -o jsonpath='{.status}' | jq

echo -e "\n=== Nginx Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME

echo -e "\n=== Deployment ==="
kubectl get deployment $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== Deployment Events ==="
kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$NGINX_NAME,involvedObject.kind=Deployment

echo -e "\n=== Pods ==="
kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME

echo -e "\n=== Pod Events ==="
for pod in $(kubectl get pods -n $NAMESPACE -l nginx=$NGINX_NAME -o name); do
    echo "Events for $pod:"
    kubectl get events -n $NAMESPACE --field-selector involvedObject.name=$(basename $pod)
done

echo -e "\n=== Service ==="
kubectl get svc $NGINX_NAME -n $NAMESPACE -o yaml

echo -e "\n=== ConfigMap ==="
kubectl get cm ${NGINX_NAME}-config -n $NAMESPACE -o yaml

echo -e "\n=== Operator Logs (last 50 lines) ==="
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager --tail=50

使用:

chmod +x debug-nginx.sh
./debug-nginx.sh default nginx-sample

11.3 故障排除流程

11.3.1 問題診斷流程圖

1. CR 能否創建成功?
   ├─ No  → 檢查 CRD 安裝、Validation Webhook
   └─ Yes → 繼續

2. Operator 能否收到事件?
   ├─ No  → 檢查 Operator 運行狀態、RBAC 權限、Namespace 配置
   └─ Yes → 繼續

3. Reconcile 是否執行?
   ├─ No  → 檢查 Predicate 過濾、事件觸發條件
   └─ Yes → 繼續

4. 子資源是否創建?
   ├─ No  → 檢查 RBAC、資源定義、錯誤日誌
   └─ Yes → 繼續

5. 子資源狀態是否正常?
   ├─ No  → 檢查資源配置、鏡像、探針、資源限制
   └─ Yes → 問題解決

11.3.2 常見場景排查

場景 1:Pod 無法啟動

# 1. 查看 Pod 狀態
kubectl get pods -l nginx=nginx-sample

# 2. 查看詳細錯誤
kubectl describe pod <pod-name>

# 3. 查看日誌
kubectl logs <pod-name>

# 常見原因:
# - ImagePullBackOff → 鏡像不存在或無權限
# - CrashLoopBackOff → 容器啟動失敗,檢查配置和日誌
# - Pending → 資源不足或調度失敗

場景 2:配置更新不生效

# 1. 檢查 ConfigMap 是否更新
kubectl get cm nginx-sample-config -o yaml

# 2. 檢查 Deployment 的 config-version label
kubectl get deployment nginx-sample -o jsonpath='{.spec.template.metadata.labels.config-version}'

# 3. 如果 label 沒變,檢查 reconcile 日誌
kubectl logs -n nginx-operator-system deployment/nginx-operator-controller-manager | grep nginx-sample

# 4. 手動觸發 reconcile
kubectl annotate nginx nginx-sample force-reconcile="$(date +%s)"

場景 3:資源洩漏

# 1. 列出所有孤立資源(沒有 owner reference)
kubectl get deployment,svc,cm --all-namespaces -o json | \
    jq -r '.items[] | select(.metadata.ownerReferences == null) | "\(.metadata.namespace)/\(.kind)/\(.metadata.name)"'

# 2. 清理孤立資源
kubectl delete deployment <name> -n <namespace>

# 3. 確保 Controller 設置了 Owner Reference
# 代碼檢查:
if err := controllerutil.SetControllerReference(nginx, deployment, r.Scheme); err != nil {
    return err
}

11.4 最佳實踐總結

11.4.1 開發階段

  1. 使用 Kind 本地測試

     make kind-cluster
     make install
     make run
    
  2. 頻繁運行測試

     make test
     make test-e2e
    
  3. 使用 Linter

     golangci-lint run ./...
    

11.4.2 部署階段

  1. 使用 Kustomize 管理環境

     config/
     ├── base/
     └── overlays/
         ├── development/
         ├── staging/
         └── production/
    
  2. 配置資源限制

     resources:
       requests:
         cpu: 100m
         memory: 128Mi
       limits:
         cpu: 500m
         memory: 512Mi
    
  3. 啟用 Leader Election

     LeaderElection: true
    

11.4.3 運維階段

  1. 監控關鍵指標

    • Reconcile 成功率

    • Reconcile 延遲

    • 資源使用率

  2. 配置告警

    • Operator Down

    • 高錯誤率

    • 資源洩漏

  3. 定期備份

     kubectl get nginx --all-namespaces -o yaml > backup.yaml
    

總結

本教學從基礎概念到高級實踐,完整覆蓋了 Kubernetes Operator 開發的各個方面:

  1. 第一~三章:Operator 基礎概念、架構設計

  2. 第四~六章:深入 OpenTelemetry Operator 代碼分析

  3. 第七~八章:開發環境、測試實踐

  4. 第九章:完整的 Nginx Operator 實戰

  5. 第十章:性能、安全、可觀測性等進階主題

  6. 第十一章:故障排查和最佳實踐

下一步建議

  1. 閱讀 Operator SDK 文檔

  2. 研究優秀的開源 Operator 項目

  3. 在實際項目中實踐所學知識

成為 Kubernetes Operator 開發專家! Go 🚀

More from this blog

Claude Code 監控秘錄:OpenTelemetry(OTel/OTLP)實戰指南

稟告主公:此乃司馬懿進呈之兵書,詳解如何以 OpenTelemetry 陣法,令臥龍神算之一舉一動盡在掌握,知糧草消耗、察兵器效能、辨戰報異常,使主公運籌帷幄於大帳之中。 為何需要斥候情報? 司馬懿稟告主公: 臥龍神算(Claude Code)乃當世利器,然若無斥候回報,主公便如蒙眼行軍——兵器耗損幾何、糧草消費幾許、哪路斥候出了差錯,一概不知。臣以為,此乃兵家大忌。 無情報之弊,有四: 軍

Feb 19, 202610 min read178
Claude Code 監控秘錄:OpenTelemetry(OTel/OTLP)實戰指南

工程師的 Claude Code 實戰指南:從零開始到高效開發

工程師的 Claude Code 實戰指南:從零開始到高效開發 本文整合 Anthropic 官方 Best Practices 與社群實戰 Tips,帶你由淺入深掌握 Claude Code。 什麼是 Claude Code?為什麼值得學? 如果你還在用「複製程式碼貼到 ChatGPT,再複製答案貼回去」的工作流程,Claude Code 會讓你大開眼界。 Claude Code 是 Anthropic 推出的命令列工具,它直接活在你的 terminal 裡,能夠讀懂你的整個 codeb...

Feb 18, 20265 min read77
工程師的 Claude Code 實戰指南:從零開始到高效開發

System Design Interview Ch 12 Digital Wallet

確立問題與設計範疇 角色對話內容 面試者我們應該只關注兩個數位錢包之間的餘額轉帳操作嗎?我們是否需要擔心其他功能? 面試官讓我們只關注餘額轉帳操作。 面試者該系統需要支援多少 TPS(每秒交易次數)? 面試官讓我們假設是 1,000,000 TPS (每秒 100 萬次交易)。 面試者數位錢包對正確性有嚴格的要求。我們可以假設事務保證 就足夠了嗎? 面試官聽起來不錯。 面試者我們需要證明正確性嗎? 面試官這是一個很好的問題。正確性(Correctness)通常只有在交...

Feb 2, 202610 min read229
System Design Interview Ch 12 Digital Wallet

Claude Code 利用 Event-Driven Hooks 打造自動化開發大腦

在現代 AI 輔助開發中,我們不僅需要 AI 寫程式,更需要它懂規則、記性好,並且能自動處理那些繁瑣的雜事。透過 Claude Code Hooks 機制,我們可以介入 AI 的思考與執行迴圈,實現真正的「人機協作自動化」。 一、 動機與痛點:為什麼你需要介入 AI 的生命週期? 在預設狀態下,Claude Code 雖然強大,但它是「被動」且「無狀態」的,這導致了開發者常遇到以下痛點: 記憶重置 (Session Amnesia): 痛點:每次重啟終端機,AI 就像失憶一樣。 解法:你...

Jan 24, 20266 min read538
Claude Code 利用 Event-Driven Hooks 打造自動化開發大腦
M

MicroFIRE

71 posts