Maison  >  Article  >  développement back-end  >  Implémentation d'un système de traitement des commandes : surveillance des pièces et alertes

Implémentation d'un système de traitement des commandes : surveillance des pièces et alertes

王林
王林original
2024-09-05 22:41:14584parcourir

Implementing an Order Processing System: Part  Monitoring and Alerting

1. Introduction et objectifs

Bienvenue dans le quatrième volet de notre série sur la mise en œuvre d'un système sophistiqué de traitement des commandes ! Dans nos articles précédents, nous avons jeté les bases de notre projet, exploré les flux de travail temporels avancés et approfondi les opérations avancées de base de données. Aujourd'hui, nous nous concentrons sur un aspect tout aussi crucial de tout système prêt pour la production : la surveillance et les alertes.

Récapitulatif des articles précédents

  1. Dans la première partie, nous avons mis en place la structure de notre projet et implémenté une API CRUD de base.
  2. Dans la deuxième partie, nous avons élargi notre utilisation de Temporal, en mettant en œuvre des flux de travail complexes et en explorant des concepts avancés.
  3. Dans la troisième partie, nous nous sommes concentrés sur les opérations avancées de base de données, notamment l'optimisation, le partitionnement et la garantie de la cohérence dans les systèmes distribués.

Importance de la surveillance et des alertes dans l’architecture des microservices

Dans une architecture de microservices, en particulier celle qui gère des processus complexes tels que la gestion des commandes, une surveillance et des alertes efficaces sont cruciales. Ils nous permettent de :

  1. Comprendre le comportement et les performances de notre système en temps réel
  2. Identifiez et diagnostiquez rapidement les problèmes avant qu'ils n'affectent les utilisateurs
  3. Prendre des décisions basées sur les données pour la mise à l'échelle et l'optimisation
  4. Assurer la fiabilité et la disponibilité de nos services

Aperçu de Prometheus et de son écosystème

Prometheus est une boîte à outils open source de surveillance et d'alerte des systèmes. Il est devenu un standard dans le monde du cloud natif en raison de ses fonctionnalités puissantes et de son vaste écosystème. Les composants clés incluent :

  1. Prometheus Server : récupère et stocke les données de séries chronologiques
  2. Bibliothèques clientes : Permettent une instrumentation facile du code de l'application
  3. Alertmanager : Gère les alertes du serveur Prometheus
  4. Pushgateway : Permet aux tâches éphémères et par lots d'exposer des métriques
  5. Exportateurs : Autoriser les systèmes tiers à exposer des métriques à Prometheus

Nous utiliserons également Grafana, une plateforme open source populaire pour la surveillance et l'observabilité, pour créer des tableaux de bord et visualiser nos données Prometheus.

Objectifs pour cette partie de la série

À la fin de cet article, vous pourrez :

  1. Configurer Prometheus pour surveiller notre système de traitement des commandes
  2. Mettre en œuvre des métriques personnalisées dans nos services Go
  3. Créez des tableaux de bord informatifs à l'aide de Grafana
  4. Configurez des règles d'alerte pour nous informer des problèmes potentiels
  5. Surveiller efficacement les performances de la base de données et les flux de travail temporels

Plongeons-nous !

2. Contexte théorique et concepts

Avant de commencer la mise en œuvre, passons en revue quelques concepts clés qui seront cruciaux pour notre configuration de surveillance et d'alerte.

Observabilité dans les systèmes distribués

L'observabilité fait référence à la capacité de comprendre l'état interne d'un système en examinant ses sorties. Dans les systèmes distribués comme notre système de traitement des commandes, l'observabilité englobe généralement trois piliers principaux :

  1. Metrics : Représentations numériques des données mesurées sur des intervalles de temps
  2. Journaux : enregistrements détaillés des événements discrets au sein du système
  3. Traces : Représentations de chaînes causales d'événements à travers les composants

Dans cet article, nous nous concentrerons principalement sur les métriques, mais nous aborderons la manière dont celles-ci peuvent être intégrées aux journaux et aux traces.

Architecture de Prométhée

Prometheus suit une architecture basée sur le pull :

  1. Collecte de données : Prometheus récupère les métriques des tâches instrumentées via HTTP
  2. Stockage des données : Les métriques sont stockées dans une base de données de séries temporelles sur le stockage local
  3. Requête : PromQL permet une interrogation flexible de ces données
  4. Alertes : Prometheus peut déclencher des alertes en fonction des résultats de requête
  5. Visualisation : Bien que Prometheus dispose d'une interface utilisateur de base, il est souvent associé à Grafana pour des visualisations plus riches

Types de métriques dans Prometheus

Prometheus propose quatre types de métriques de base :

  1. 計數器:只會上升的累積指標(例如,處理的請求數量)
  2. Gauge:可以上下波動的指標(例如,目前記憶體使用)
  3. 直方圖:對觀察結果進行取樣並在可配置的儲存桶中對它們進行計數(例如,請求持續時間)
  4. 摘要:與直方圖類似,但計算滑動時間視窗上的可設定分位數

PromQL 簡介

PromQL(Prometheus Query Language)是一種用於查詢 Prometheus 資料的強大函數式語言。它允許您即時選擇和聚合時間序列資料。主要功能包括:

  • 即時向量選擇器
  • 範圍向量選擇器
  • 偏移修改器
  • 聚合運算子
  • 二元運算子

在建立儀表板和警報時,我們將看到 PromQL 查詢的範例。

Grafana 概述

Grafana 是一個多平台開源分析和互動式視覺化 Web 應用程式。當連接到受支援的資料來源(Prometheus 就是其中之一)時,它會為網路提供圖表、圖形和警報。主要功能包括:

  • 靈活的儀表板建立
  • 廣泛的視覺化選項
  • 警報功能
  • 使用者認證與授權
  • 可擴充性的插件系統

現在我們已經介紹了這些概念,讓我們開始實施我們的監控和警報系統。

3. 為我們的訂單處理系統設定 Prometheus

讓我們先設定 Prometheus 來監控我們的訂單處理系統。

安裝和配置普羅米修斯

首先,讓我們將 Prometheus 加入 docker-compose.yml 檔案:

services:
  # ... other services ...

  prometheus:
    image: prom/prometheus:v2.30.3
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090

volumes:
  # ... other volumes ...
  prometheus_data: {}

接下來,在./prometheus目錄下建立prometheus.yml檔:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'order_processing_api'
    static_configs:
      - targets: ['order_processing_api:8080']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']

此配置告訴 Prometheus 從自身、我們的訂單處理 API 和 Postgres 導出器(我們稍後將設定)中獲取指標。

為我們的 Go 服務實施 Prometheus Exporters

為了公開 Go 服務的指標,我們將使用 Prometheus 用戶端程式庫。首先,將其添加到您的 go.mod 中:

go get github.com/prometheus/client_golang

現在,讓我們修改我們的主 Go 檔案以公開指標:

package main

import (
    "net/http"

    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    r := gin.Default()

    // Middleware to record metrics
    r.Use(func(c *gin.Context) {
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(c.Request.Method, c.FullPath()))
        c.Next()
        timer.ObserveDuration()
        httpRequestsTotal.WithLabelValues(c.Request.Method, c.FullPath(), string(c.Writer.Status())).Inc()
    })

    // Expose metrics endpoint
    r.GET("/metrics", gin.WrapH(promhttp.Handler()))

    // ... rest of your routes ...

    r.Run(":8080")
}

此程式碼設定了兩個指標:

  1. http_requests_total:追蹤 HTTP 請求總數的計數器
  2. http_request_duration_seconds:追蹤 HTTP 請求持續時間的直方圖

為動態環境設定服務發現

對於更動態的環境,Prometheus 支援各種服務發現機制。例如,如果您在 Kubernetes 上運行,則可以使用 Kubernetes SD 配置:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

此配置將自動發現並從具有適當註解的 pod 中抓取指標。

配置 Prometheus 資料的保留和存儲

Prometheus 將資料儲存在本機檔案系統上的時間序列資料庫中。您可以在 Prometheus 配置中配置保留時間和儲存大小:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

# ... rest of the configuration ...

此組態設定保留期為 15 天,最大儲存大小為 50GB。

在下一節中,我們將深入研究為訂單處理系統定義和實現自訂指標。

4. 定義和實施自訂指標

現在我們已經設定了 Prometheus 並實現了基本的 HTTP 指標,讓我們定義並實現特定於我們的訂單處理系統的自訂指標。

為我們的訂單處理系統設計指標架構

在設計指標時,重要的是要考慮我們希望從系統中獲得哪些見解。對於我們的訂單處理系統,我們可能想要追蹤:

  1. 訂單創建率
  2. 訂單處理時間
  3. 訂單狀態分佈
  4. 支付處理成功/失敗率
  5. 庫存更新操作
  6. 出貨安排時間

讓我們實現這些指標:

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    OrdersCreated = promauto.NewCounter(prometheus.CounterOpts{
        Name: "orders_created_total",
        Help: "The total number of created orders",
    })

    OrderProcessingTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "order_processing_seconds",
        Help: "Time taken to process an order",
        Buckets: prometheus.LinearBuckets(0, 30, 10), // 0-300 seconds, 30-second buckets
    })

    OrderStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "orders_by_status",
        Help: "Number of orders by status",
    }, []string{"status"})

    PaymentProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "payments_processed_total",
        Help: "The total number of processed payments",
    }, []string{"status"})

    InventoryUpdates = promauto.NewCounter(prometheus.CounterOpts{
        Name: "inventory_updates_total",
        Help: "The total number of inventory updates",
    })

    ShippingArrangementTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "shipping_arrangement_seconds",
        Help: "Time taken to arrange shipping",
        Buckets: prometheus.LinearBuckets(0, 60, 5), // 0-300 seconds, 60-second buckets
    })
)

在我們的 Go 服務中實施特定於應用程式的指標

現在我們已經定義了指標,讓我們在我們的服務中實現它們:

package main

import (
    "time"

    "github.com/yourusername/order-processing-system/metrics"
)

func createOrder(order Order) error {
    startTime := time.Now()

    // Order creation logic...

    metrics.OrdersCreated.Inc()
    metrics.OrderProcessingTime.Observe(time.Since(startTime).Seconds())
    metrics.OrderStatusGauge.WithLabelValues("pending").Inc()

    return nil
}

func processPayment(payment Payment) error {
    // Payment processing logic...

    if paymentSuccessful {
        metrics.PaymentProcessed.WithLabelValues("success").Inc()
    } else {
        metrics.PaymentProcessed.WithLabelValues("failure").Inc()
    }

    return nil
}

func updateInventory(item Item) error {
    // Inventory update logic...

    metrics.InventoryUpdates.Inc()

    return nil
}

func arrangeShipping(order Order) error {
    startTime := time.Now()

    // Shipping arrangement logic...

    metrics.ShippingArrangementTime.Observe(time.Since(startTime).Seconds())

    return nil
}

命名和標記指標的最佳實踐

命名和標記指標時,請考慮以下最佳實踐:

  1. Use a consistent naming scheme (e.g., __)
  2. Use clear, descriptive names
  3. Include units in the metric name (e.g., _seconds, _bytes)
  4. Use labels to differentiate instances of a metric, but be cautious of high cardinality
  5. Keep the number of labels manageable

Instrumenting Key Components: API Endpoints, Database Operations, Temporal Workflows

For API endpoints, we’ve already implemented basic instrumentation. For database operations, we can add metrics like this:

func (s *Store) GetOrder(ctx context.Context, id int64) (Order, error) {
    startTime := time.Now()
    defer func() {
        metrics.DBOperationDuration.WithLabelValues("GetOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing GetOrder logic...
}

For Temporal workflows, we can add metrics in our activity implementations:

func ProcessOrderActivity(ctx context.Context, order Order) error {
    startTime := time.Now()
    defer func() {
        metrics.WorkflowActivityDuration.WithLabelValues("ProcessOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing ProcessOrder logic...
}

5. Creating Dashboards with Grafana

Now that we have our metrics set up, let’s visualize them using Grafana.

Installing and Configuring Grafana

First, let’s add Grafana to our docker-compose.yml:

services:
  # ... other services ...

  grafana:
    image: grafana/grafana:8.2.2
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  # ... other volumes ...
  grafana_data: {}

Connecting Grafana to Our Prometheus Data Source

  1. Access Grafana at http://localhost:3000 (default credentials are admin/admin)
  2. Go to Configuration > Data Sources
  3. Click “Add data source” and select Prometheus
  4. Set the URL to http://prometheus:9090 (this is the Docker service name)
  5. Click “Save & Test”

Designing Effective Dashboards for Our Order Processing System

Let’s create a dashboard for our order processing system:

  1. Click “Create” > “Dashboard”
  2. Add a new panel

For our first panel, let’s create a graph of order creation rate:

  1. In the query editor, enter: rate(orders_created_total[5m])
  2. Set the panel title to “Order Creation Rate”
  3. Under Settings, set the unit to “orders/second”

Let’s add another panel for order processing time:

  1. Add a new panel
  2. Query: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m]))
  3. Title: “95th Percentile Order Processing Time”
  4. Unit: “seconds”

For order status distribution:

  1. Add a new panel
  2. Query: orders_by_status
  3. Visualization: Pie Chart
  4. Title: “Order Status Distribution”

Continue adding panels for other metrics we’ve defined.

Implementing Variable Templating for Flexible Dashboards

Grafana allows us to create variables that can be used across the dashboard. Let’s create a variable for time range:

  1. Go to Dashboard Settings > Variables
  2. Click “Add variable”
  3. Name: time_range
  4. Type: Interval
  5. Values: 5m,15m,30m,1h,6h,12h,24h,7d

Now we can use this in our queries like this: rate(orders_created_total[$time_range])

Best Practices for Dashboard Design and Organization

  1. Group related panels together
  2. Use consistent color schemes
  3. Include a description for each panel
  4. Use appropriate visualizations for each metric type
  5. Consider creating separate dashboards for different aspects of the system (e.g., Orders, Inventory, Shipping)

In the next section, we’ll set up alerting rules to notify us of potential issues in our system.

6. Implementing Alerting Rules

Now that we have our metrics and dashboards set up, let’s implement alerting to proactively notify us of potential issues in our system.

Designing an Alerting Strategy for Our System

When designing alerts, consider the following principles:

  1. Alert on symptoms, not causes
  2. Ensure alerts are actionable
  3. Avoid alert fatigue by only alerting on critical issues
  4. Use different severity levels for different types of issues

For our order processing system, we might want to alert on:

  1. High error rate in order processing
  2. Slow order processing time
  3. Unusual spike or drop in order creation rate
  4. Low inventory levels
  5. High rate of payment failures

Implementing Prometheus Alerting Rules

Let’s create an alerts.yml file in our Prometheus configuration directory:

groups:
- name: order_processing_alerts
  rules:
  - alert: HighOrderProcessingErrorRate
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High order processing error rate
      description: "Error rate is over the last 5 minutes"

  - alert: SlowOrderProcessing
    expr: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m])) > 300
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Slow order processing
      description: "95th percentile of order processing time is over the last 5 minutes"

  - alert: UnusualOrderRate
    expr: abs(rate(orders_created_total[1h]) - rate(orders_created_total[1h] offset 1d)) > (rate(orders_created_total[1h] offset 1d) * 0.3)
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Unusual order creation rate
      description: "Order creation rate has changed by more than 30% compared to the same time yesterday"

  - alert: LowInventory
    expr: inventory_level < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Low inventory level
      description: "Inventory level for is "

  - alert: HighPaymentFailureRate
    expr: rate(payments_processed_total{status="failure"}[15m]) / rate(payments_processed_total[15m]) > 0.1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High payment failure rate
      description: "Payment failure rate is over the last 15 minutes"

Update your prometheus.yml to include this alerts file:

rule_files:
  - "alerts.yml"

Setting Up Alertmanager for Alert Routing and Grouping

Now, let’s set up Alertmanager to handle our alerts. Add Alertmanager to your docker-compose.yml:

services:
  # ... other services ...

  alertmanager:
    image: prom/alertmanager:v0.23.0
    ports:
      - 9093:9093
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

Create an alertmanager.yml in the ./alertmanager directory:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'team@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_identity: 'alertmanager@example.com'
    auth_password: 'password'

Update your prometheus.yml to point to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Configuring Notification Channels

In the Alertmanager configuration above, we’ve set up email notifications. You can also configure other channels like Slack, PagerDuty, or custom webhooks.

Implementing Alert Severity Levels and Escalation Policies

In our alerts, we’ve used severity labels. We can use these in Alertmanager to implement different routing or notification strategies based on severity:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'team@example.com'
- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '<your-pagerduty-service-key>'
- name: 'slack-warnings'
  slack_configs:
  - api_url: '<your-slack-webhook-url>'
    channel: '#alerts'

7. Monitoring Database Performance

Monitoring database performance is crucial for maintaining a responsive and reliable system. Let’s set up monitoring for our PostgreSQL database.

Implementing the Postgres Exporter for Prometheus

First, add the Postgres exporter to your docker-compose.yml:

services:
  # ... other services ...

  postgres_exporter:
    image: wrouesnel/postgres_exporter:latest
    environment:
      DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
    ports:
      - 9187:9187

Make sure to replace user, password, and dbname with your actual PostgreSQL credentials.

Key Metrics to Monitor for Postgres Performance

Some important PostgreSQL metrics to monitor include:

  1. Number of active connections
  2. Database size
  3. Query execution time
  4. Cache hit ratio
  5. Replication lag (if using replication)
  6. Transaction rate
  7. Tuple operations (inserts, updates, deletes)

Creating a Database Performance Dashboard in Grafana

Let’s create a new dashboard for database performance:

  1. Create a new dashboard in Grafana
  2. Add a panel for active connections:
    • Query: pg_stat_activity_count{datname="your_database_name"}
    • Title: “Active Connections”
  3. Add a panel for database size:
    • Query: pg_database_size_bytes{datname="your_database_name"}
    • Title: “Database Size”
    • Unit: bytes(IEC)
  4. Add a panel for query execution time:
    • Query: rate(pg_stat_database_xact_commit{datname="your_database_name"}[5m]) + rate(pg_stat_database_xact_rollback{datname="your_database_name"}[5m])
    • Title: “Transactions per Second”
  5. Add a panel for cache hit ratio:
    • Query: pg_stat_database_blks_hit{datname="your_database_name"} / (pg_stat_database_blks_hit{datname="your_database_name"} + pg_stat_database_blks_read{datname="your_database_name"})
    • Title: “Cache Hit Ratio”

Setting Up Alerts for Database Issues

Let’s add some database-specific alerts to our alerts.yml:

  - alert: HighDatabaseConnections
    expr: pg_stat_activity_count > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High number of database connections
      description: "There are active database connections"

  - alert: LowCacheHitRatio
    expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.9
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Low database cache hit ratio
      description: "Cache hit ratio is "

8. Monitoring Temporal Workflows

Monitoring Temporal workflows is essential for ensuring the reliability and performance of our order processing system.

Implementing Temporal Metrics in Our Go Services

Temporal provides a metrics client that we can use to expose metrics to Prometheus. Let’s update our Temporal worker to include metrics:

import (
    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
    "go.temporal.io/sdk/contrib/prometheus"
)

func main() {
    // ... other setup ...

    // Create Prometheus metrics handler
    metricsHandler := prometheus.NewPrometheusMetricsHandler()

    // Create Temporal client with metrics
    c, err := client.NewClient(client.Options{
        MetricsHandler: metricsHandler,
    })
    if err != nil {
        log.Fatalln("Unable to create Temporal client", err)
    }
    defer c.Close()

    // Create worker with metrics
    w := worker.New(c, "order-processing-task-queue", worker.Options{
        MetricsHandler: metricsHandler,
    })

    // ... register workflows and activities ...

    // Run the worker
    err = w.Run(worker.InterruptCh())
    if err != nil {
        log.Fatalln("Unable to start worker", err)
    }
}

Key Metrics to Monitor for Temporal Workflows

Important Temporal metrics to monitor include:

  1. Workflow start rate
  2. Workflow completion rate
  3. Workflow execution time
  4. Activity success/failure rate
  5. Activity execution time
  6. Task queue latency

Creating a Temporal Workflow Dashboard in Grafana

Let’s create a dashboard for Temporal workflows:

  1. Create a new dashboard in Grafana
  2. Add a panel for workflow start rate:
    • Query: rate(temporal_workflow_start_total[5m])
    • Title: “Workflow Start Rate”
  3. Add a panel for workflow completion rate:
    • Query: rate(temporal_workflow_completed_total[5m])
    • Title: “Workflow Completion Rate”
  4. Add a panel for workflow execution time:
    • Query: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[5m]))
    • Title: “95th Percentile Workflow Execution Time”
    • Unit: seconds
  5. Add a panel for activity success rate:
    • Query: rate(temporal_activity_success_total[5m]) / (rate(temporal_activity_success_total[5m]) + rate(temporal_activity_fail_total[5m]))
    • Title: “Activity Success Rate”

Setting Up Alerts for Workflow Issues

Let’s add some Temporal-specific alerts to our alerts.yml:

  - alert: HighWorkflowFailureRate
    expr: rate(temporal_workflow_failed_total[15m]) / rate(temporal_workflow_completed_total[15m]) > 0.05
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High workflow failure rate
      description: "Workflow failure rate is over the last 15 minutes"

  - alert: LongRunningWorkflow
    expr: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[1h])) > 3600
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Long-running workflows detected
      description: "95th percentile of workflow execution time is over 1 hour"

These alerts will help you detect issues with your Temporal workflows, such as high failure rates or unexpectedly long-running workflows.

In the next sections, we’ll cover some advanced Prometheus techniques and discuss testing and validation of our monitoring setup.

9. Advanced Prometheus Techniques

As our monitoring system grows more complex, we can leverage some advanced Prometheus techniques to improve its efficiency and capabilities.

Using Recording Rules for Complex Queries and Aggregations

Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. This can significantly speed up the evaluation of dashboards and alerts.

Let’s add some recording rules to our Prometheus configuration. Create a rules.yml file:

groups:
- name: example_recording_rules
  interval: 5m
  rules:
  - record: job:order_processing_rate:5m
    expr: rate(orders_created_total[5m])

  - record: job:order_processing_error_rate:5m
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m])

  - record: job:payment_success_rate:5m
    expr: rate(payments_processed_total{status="success"}[5m]) / rate(payments_processed_total[5m])

Add this file to your Prometheus configuration:

rule_files:
  - "alerts.yml"
  - "rules.yml"

Now you can use these precomputed metrics in your dashboards and alerts, which can be especially helpful for complex queries that you use frequently.

Implementing Push Gateway for Batch Jobs and Short-Lived Processes

The Pushgateway allows you to push metrics from jobs that can’t be scraped, such as batch jobs or serverless functions. Let’s add a Pushgateway to our docker-compose.yml:

services:
  # ... other services ...

  pushgateway:
    image: prom/pushgateway
    ports:
      - 9091:9091

Now, you can push metrics to the Pushgateway from your batch jobs or short-lived processes. Here’s an example using the Go client:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/push"
)

func runBatchJob() {
    // Define a counter for the batch job
    batchJobCounter := prometheus.NewCounter(prometheus.CounterOpts{
        Name: "batch_job_processed_total",
        Help: "Total number of items processed by the batch job",
    })

    // Run your batch job and update the counter
    // ...

    // Push the metric to the Pushgateway
    pusher := push.New("http://pushgateway:9091", "batch_job")
    pusher.Collector(batchJobCounter)
    if err := pusher.Push(); err != nil {
        log.Printf("Could not push to Pushgateway: %v", err)
    }
}

Don’t forget to add the Pushgateway as a target in your Prometheus configuration:

scrape_configs:
  # ... other configs ...

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']

Federated Prometheus Setups for Large-Scale Systems

For large-scale systems, you might need to set up Prometheus federation, where one Prometheus server scrapes data from other Prometheus servers. This allows you to aggregate metrics from multiple Prometheus instances.

Here’s an example configuration for a federated Prometheus setup:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="order_processing_api"}'
        - '{job="postgres_exporter"}'
    static_configs:
      - targets:
        - 'prometheus-1:9090'
        - 'prometheus-2:9090'

This configuration allows a higher-level Prometheus server to scrape specific metrics from other Prometheus servers.

Using Exemplars for Tracing Integration

Exemplars allow you to link metrics to trace data, providing a way to drill down from a high-level metric to a specific trace. This is particularly useful when integrating Prometheus with distributed tracing systems like Jaeger or Zipkin.

To use exemplars, you need to enable them in your Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  exemplar_storage:
    enable: true

Then, when instrumenting your code, you can add exemplars to your metrics:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    orderProcessingDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "order_processing_duration_seconds",
            Help: "Duration of order processing in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"status"},
    )
)

func processOrder(order Order) {
    start := time.Now()
    // Process the order...
    duration := time.Since(start)

    orderProcessingDuration.WithLabelValues(order.Status).Observe(duration.Seconds(),
        prometheus.Labels{
            "traceID": getCurrentTraceID(),
        },
    )
}

This allows you to link from a spike in order processing duration directly to the trace of a slow order, greatly aiding in debugging and performance analysis.

10. Testing and Validation

Ensuring the reliability of your monitoring system is crucial. Let’s explore some strategies for testing and validating our Prometheus setup.

Unit Testing Metric Instrumentation

When unit testing your Go code, you can use the prometheus/testutil package to verify that your metrics are being updated correctly:

import (
    "testing"

    "github.com/prometheus/client_golang/prometheus/testutil"
)

func TestOrderProcessing(t *testing.T) {
    // Process an order
    processOrder(Order{ID: 1, Status: "completed"})

    // Check if the metric was updated
    expected := `
        # HELP order_processing_duration_seconds Duration of order processing in seconds
        # TYPE order_processing_duration_seconds histogram
        order_processing_duration_seconds_bucket{status="completed",le="0.005"} 1
        order_processing_duration_seconds_bucket{status="completed",le="0.01"} 1
        # ... other buckets ...
        order_processing_duration_seconds_sum{status="completed"} 0.001
        order_processing_duration_seconds_count{status="completed"} 1
    `
    if err := testutil.CollectAndCompare(orderProcessingDuration, strings.NewReader(expected)); err != nil {
        t.Errorf("unexpected collecting result:\n%s", err)
    }
}

Integration Testing for Prometheus Scraping

To test that Prometheus is correctly scraping your metrics, you can set up an integration test that starts your application, waits for Prometheus to scrape it, and then queries Prometheus to verify the metrics:

func TestPrometheusIntegration(t *testing.T) {
    // Start your application
    go startApp()

    // Wait for Prometheus to scrape (adjust the sleep time as needed)
    time.Sleep(30 * time.Second)

    // Query Prometheus
    client, err := api.NewClient(api.Config{
        Address: "http://localhost:9090",
    })
    if err != nil {
        t.Fatalf("Error creating client: %v", err)
    }

    v1api := v1.NewAPI(client)
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    result, warnings, err := v1api.Query(ctx, "order_processing_duration_seconds_count", time.Now())
    if err != nil {
        t.Fatalf("Error querying Prometheus: %v", err)
    }
    if len(warnings) > 0 {
        t.Logf("Warnings: %v", warnings)
    }

    // Check the result
    if result.(model.Vector).Len() == 0 {
        t.Errorf("Expected non-empty result")
    }
}

Load Testing and Observing Metrics Under Stress

It’s important to verify that your monitoring system performs well under load. You can use tools like hey or vegeta to generate load on your system while observing your metrics:

hey -n 10000 -c 100 http://localhost:8080/orders

While the load test is running, observe your Grafana dashboards and check that your metrics are updating as expected and that Prometheus is able to keep up with the increased load.

Validating Alerting Rules and Notification Channels

To test your alerting rules, you can temporarily adjust the thresholds to trigger alerts, or use Prometheus’s API to manually fire alerts:

curl -H "Content-Type: application/json" -d '{
  "alerts": [
    {
      "labels": {
        "alertname": "HighOrderProcessingErrorRate",
        "severity": "critical"
      },
      "annotations": {
        "summary": "High order processing error rate"
      }
    }
  ]
}' http://localhost:9093/api/v1/alerts

This will send a test alert to your Alertmanager, allowing you to verify that your notification channels are working correctly.

11. Challenges and Considerations

As you implement and scale your monitoring system, keep these challenges and considerations in mind:

Managing Cardinality in High-Dimensional Data

High cardinality can lead to performance issues in Prometheus. Be cautious when adding labels to metrics, especially labels with many possible values (like user IDs or IP addresses). Instead, consider using histogram metrics or reducing the cardinality by grouping similar values.

Scaling Prometheus for Large-Scale Systems

For large-scale systems, consider:

  • Using the Pushgateway for batch jobs
  • Implementing federation for large-scale setups
  • Using remote storage solutions for long-term storage of metrics

Ensuring Monitoring System Reliability and Availability

Your monitoring system is critical infrastructure. Consider:

  • 為 Prometheus 和 Alertmanager 實現高可用性
  • 監控您的監控系統(元監控)
  • 定期備份您的 Prometheus 資料

指標和警報的安全注意事項

確保:

  • 對 Prometheus 和 Grafana 的存取得到適當保護
  • 敏感資訊不會在指標或警報中暴露
  • TLS 用於監控堆疊中的所有通訊

處理暫時性問題和警報

減少警報噪音:

  • 在警報規則中使用適當的時間視窗
  • 在Alertmanager中實作警報分組
  • 考慮對相關警報使用警報抑制

12. 後續步驟和第 5 部分的預覽

在這篇文章中,我們介紹了使用 Prometheus 和 Grafana 對訂單處理系統進行全面監控和警報。我們設定了自訂指標,創建了資訊豐富的儀表板,實施了警報,並探索了先進的技術和注意事項。

在我們系列的下一部分中,我們將專注於分散式追蹤和日誌記錄。我們將介紹:

  1. 使用 OpenTelemetry 實作分散式追蹤
  2. 使用 ELK 堆疊設定集中式日誌記錄
  3. 關聯日誌、追蹤和指標以進行有效調試
  4. 實現日誌聚合和分析
  5. 登入微服務架構的最佳實務

請繼續關注我們繼續增強我們的訂單處理系統,接下來的重點是更深入地了解我們的分散式系統的行為和性能!


需要幫助嗎?

您是否面臨著具有挑戰性的問題,或需要外部視角來看待新想法或專案?我可以幫忙!無論您是想在進行更大投資之前建立技術概念驗證,還是需要解決困難問題的指導,我都會為您提供協助。

提供的服務:

  • 解決問題:透過創新的解決方案解決複雜問題。
  • 諮詢:為您的專案提供專家建議和新觀點。
  • 概念驗證:開發初步模型來測試和驗證您的想法。

如果您有興趣與我合作,請透過電子郵件與我聯繫:hungaikevin@gmail.com。

讓我們將挑戰轉化為機會!

Ce qui précède est le contenu détaillé de. pour plus d'informations, suivez d'autres articles connexes sur le site Web de PHP en chinois!

Déclaration:
Le contenu de cet article est volontairement contribué par les internautes et les droits d'auteur appartiennent à l'auteur original. Ce site n'assume aucune responsabilité légale correspondante. Si vous trouvez un contenu suspecté de plagiat ou de contrefaçon, veuillez contacter admin@php.cn