>백엔드 개발 >Golang >주문 처리 시스템 구현: 부품 모니터링 및 경고

주문 처리 시스템 구현: 부품 모니터링 및 경고

王林
王林원래의
2024-09-05 22:41:14651검색

Implementing an Order Processing System: Part  Monitoring and Alerting

1. 소개 및 목표

정교한 주문 처리 시스템 구현에 관한 시리즈의 네 번째 기사에 오신 것을 환영합니다! 이전 게시물에서는 프로젝트의 기초를 마련하고, 고급 임시 워크플로를 탐색하고, 고급 데이터베이스 작업을 자세히 살펴봤습니다. 오늘 우리는 생산 준비 시스템에서 똑같이 중요한 측면인 모니터링과 경고에 초점을 맞추고 있습니다.

이전 게시물 요약

  1. 1부에서는 프로젝트 구조를 설정하고 기본 CRUD API를 구현했습니다.
  2. 2부에서는 Temporal의 사용을 확장하여 복잡한 워크플로를 구현하고 고급 개념을 탐구했습니다.
  3. 3부에서는 최적화, 샤딩, 분산 시스템의 일관성 보장을 포함한 고급 데이터베이스 운영에 중점을 두었습니다.

마이크로서비스 아키텍처에서 모니터링 및 경고의 중요성

마이크로서비스 아키텍처, 특히 주문 관리와 같은 복잡한 프로세스를 처리하는 아키텍처에서는 효과적인 모니터링 및 알림이 매우 중요합니다. 이를 통해 우리는 다음을 수행할 수 있습니다.

  1. 실시간으로 시스템의 동작과 성능을 이해하세요
  2. 문제가 사용자에게 영향을 미치기 전에 신속하게 식별 및 진단
  3. 확장 및 최적화를 위해 데이터 기반 의사결정
  4. 서비스의 신뢰성과 가용성 보장

프로메테우스와 생태계 개요

Prometheus는 오픈 소스 시스템 모니터링 및 경고 도구 키트입니다. 강력한 기능과 광범위한 생태계로 인해 클라우드 네이티브 세계의 표준이 되었습니다. 주요 구성 요소는 다음과 같습니다.

  1. 프로메테우스 서버 : 시계열 데이터를 스크랩하여 저장
  2. 클라이언트 라이브러리: 애플리케이션 코드를 쉽게 계측할 수 있습니다
  3. Alertmanager : Prometheus 서버의 알림을 처리합니다
  4. Pushgateway: 임시 및 일괄 작업을 통해 측정항목을 노출할 수 있습니다
  5. 내보내기: 타사 시스템이 측정항목을 Prometheus에 노출하도록 허용

또한 모니터링 및 관찰을 위한 인기 있는 오픈 소스 플랫폼인 Grafana를 사용하여 대시보드를 만들고 Prometheus 데이터를 시각화할 예정입니다.

이번 시리즈의 목표

이 게시물이 끝나면 다음을 수행할 수 있습니다.

  1. Prometheus를 설정하여 주문 처리 시스템을 모니터링하세요
  2. Go 서비스에 맞춤 측정항목 구현
  3. Grafana를 사용하여 유용한 대시보드 만들기
  4. 잠재적인 문제를 알리는 경고 규칙 설정
  5. 데이터베이스 성능 및 임시 워크플로우를 효과적으로 모니터링

들어가자!

2. 이론적 배경과 개념

구현을 시작하기 전에 모니터링 및 알림 설정에 중요한 몇 가지 주요 개념을 검토해 보겠습니다.

분산 시스템의 관찰 가능성

관찰 가능성은 출력을 검사하여 시스템의 내부 상태를 이해하는 능력을 의미합니다. 주문 처리 시스템과 같은 분산 시스템에서 관찰 가능성은 일반적으로 세 가지 주요 요소를 포함합니다.

  1. 메트릭 : 시간 간격에 따라 측정된 데이터를 수치로 표현
  2. 로그 : 시스템 내 개별 이벤트에 대한 자세한 기록
  3. 추적 : 구성 요소 전반에 걸친 인과 관계 체인 표현

이 게시물에서는 주로 측정항목에 초점을 맞추지만 이를 로그 및 추적과 통합할 수 있는 방법에 대해서도 다루겠습니다.

프로메테우스 건축

Prometheus는 풀 기반 아키텍처를 따릅니다.

  1. 데이터 수집 : Prometheus는 HTTP를 통해 계측된 작업에서 측정항목을 스크랩합니다
  2. 데이터 저장소 : 지표는 로컬 저장소의 시계열 데이터베이스에 저장됩니다
  3. 쿼리 : PromQL을 사용하면 이 데이터를 유연하게 쿼리할 수 있습니다
  4. 알림 : Prometheus는 쿼리 결과에 따라 알림을 트리거할 수 있습니다
  5. 시각화 : Prometheus에는 기본 UI가 있지만 보다 풍부한 시각화를 위해 Grafana와 결합되는 경우가 많습니다

Prometheus의 측정항목 유형

Prometheus는 네 가지 핵심 측정항목 유형을 제공합니다.

  1. Kaunter : Metrik kumulatif yang hanya meningkat (mis., bilangan permintaan yang diproses)
  2. Tolok : Metrik yang boleh naik dan turun (mis., penggunaan memori semasa)
  3. Histogram : Sampel pemerhatian dan mengiranya dalam baldi boleh dikonfigurasikan (cth., meminta tempoh)
  4. Ringkasan : Serupa dengan histogram, tetapi mengira kuantiti boleh dikonfigurasikan pada tetingkap masa gelongsor

Pengenalan kepada PromQL

PromQL (Prometheus Query Language) ialah bahasa berfungsi yang berkuasa untuk menanyakan data Prometheus. Ia membolehkan anda memilih dan mengagregat data siri masa dalam masa nyata. Ciri utama termasuk:

  • Pemilih vektor segera
  • Pemilih vektor julat
  • Pengubah suai offset
  • Pengendali pengagregatan
  • Pengendali binari

Kami akan melihat contoh pertanyaan PromQL semasa kami membina papan pemuka dan makluman kami.

Gambaran keseluruhan Grafana

Grafana ialah analitik sumber terbuka berbilang platform dan aplikasi web visualisasi interaktif. Ia menyediakan carta, graf dan makluman untuk web apabila disambungkan kepada sumber data yang disokong, yang mana Prometheus adalah salah satunya. Ciri utama termasuk:

  • Penciptaan papan pemuka yang fleksibel
  • Pelbagai pilihan visualisasi
  • Keupayaan memberi amaran
  • Pengesahan dan kebenaran pengguna
  • Sistem pemalam untuk kebolehlanjutan

Sekarang kita telah membincangkan konsep ini, mari kita mula melaksanakan sistem pemantauan dan amaran kami.

3. Menyediakan Prometheus untuk Sistem Pemprosesan Pesanan Kami

Mari mulakan dengan menyediakan Prometheus untuk memantau sistem pemprosesan pesanan kami.

Memasang dan Mengkonfigurasi Prometheus

Pertama, mari tambah Prometheus pada fail docker-compose.yml kami:

services:
  # ... other services ...

  prometheus:
    image: prom/prometheus:v2.30.3
    volumes:
      - ./prometheus:/etc/prometheus
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090

volumes:
  # ... other volumes ...
  prometheus_data: {}

Seterusnya, buat fail prometheus.yml dalam direktori ./prometheus:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'order_processing_api'
    static_configs:
      - targets: ['order_processing_api:8080']

  - job_name: 'postgres'
    static_configs:
      - targets: ['postgres_exporter:9187']

Konfigurasi ini memberitahu Prometheus untuk mengikis metrik daripada dirinya sendiri, API pemprosesan pesanan kami dan pengeksport Postgres (yang akan kami sediakan kemudian).

Melaksanakan Pengeksport Prometheus untuk Perkhidmatan Go Kami

Untuk mendedahkan metrik daripada perkhidmatan Go kami, kami akan menggunakan perpustakaan pelanggan Prometheus. Mula-mula, tambahkannya pada go.mod anda:

go get github.com/prometheus/client_golang

Sekarang, mari ubah suai fail Go utama kami untuk mendedahkan metrik:

package main

import (
    "net/http"

    "github.com/gin-gonic/gin"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var (
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "http_request_duration_seconds",
            Help: "Duration of HTTP requests in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )
)

func init() {
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
}

func main() {
    r := gin.Default()

    // Middleware to record metrics
    r.Use(func(c *gin.Context) {
        timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(c.Request.Method, c.FullPath()))
        c.Next()
        timer.ObserveDuration()
        httpRequestsTotal.WithLabelValues(c.Request.Method, c.FullPath(), string(c.Writer.Status())).Inc()
    })

    // Expose metrics endpoint
    r.GET("/metrics", gin.WrapH(promhttp.Handler()))

    // ... rest of your routes ...

    r.Run(":8080")
}

Kod ini menyediakan dua metrik:

  1. http_requests_total: Kaunter yang menjejaki jumlah permintaan HTTP
  2. http_request_duration_seconds: Histogram yang menjejaki tempoh permintaan HTTP

Menyediakan Penemuan Perkhidmatan untuk Persekitaran Dinamik

Untuk persekitaran yang lebih dinamik, Prometheus menyokong pelbagai mekanisme penemuan perkhidmatan. Contohnya, jika anda menjalankan Kubernetes, anda mungkin menggunakan konfigurasi SD Kubernetes:

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Konfigurasi ini secara automatik akan menemui dan mengikis metrik daripada pod dengan anotasi yang sesuai.

Mengkonfigurasi Pengekalan dan Penyimpanan untuk Data Prometheus

Prometheus menyimpan data dalam pangkalan data siri masa pada sistem fail tempatan. Anda boleh mengkonfigurasi masa pengekalan dan saiz storan dalam konfigurasi Prometheus:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

storage:
  tsdb:
    retention.time: 15d
    retention.size: 50GB

# ... rest of the configuration ...

Konfigurasi ini menetapkan tempoh pengekalan selama 15 hari dan saiz storan maksimum 50GB.

Dalam bahagian seterusnya, kami akan menyelami dalam menentukan dan melaksanakan metrik tersuai untuk sistem pemprosesan pesanan kami.

4. Menentukan dan Melaksanakan Metrik Tersuai

Sekarang kami telah menyediakan Prometheus dan metrik HTTP asas dilaksanakan, mari tentukan dan laksanakan metrik tersuai khusus untuk sistem pemprosesan pesanan kami.

Mereka bentuk Skema Metrik untuk Sistem Pemprosesan Pesanan Kami

Apabila mereka bentuk metrik, adalah penting untuk memikirkan tentang cerapan yang ingin kami peroleh daripada sistem kami. Untuk sistem pemprosesan pesanan kami, kami mungkin mahu menjejaki:

  1. Kadar penciptaan pesanan
  2. Masa pemprosesan pesanan
  3. Pengagihan status pesanan
  4. Kadar kejayaan/kegagalan pemprosesan pembayaran
  5. Operasi kemas kini inventori
  6. Masa urusan penghantaran

Mari laksanakan metrik ini:

package metrics

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    OrdersCreated = promauto.NewCounter(prometheus.CounterOpts{
        Name: "orders_created_total",
        Help: "The total number of created orders",
    })

    OrderProcessingTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "order_processing_seconds",
        Help: "Time taken to process an order",
        Buckets: prometheus.LinearBuckets(0, 30, 10), // 0-300 seconds, 30-second buckets
    })

    OrderStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{
        Name: "orders_by_status",
        Help: "Number of orders by status",
    }, []string{"status"})

    PaymentProcessed = promauto.NewCounterVec(prometheus.CounterOpts{
        Name: "payments_processed_total",
        Help: "The total number of processed payments",
    }, []string{"status"})

    InventoryUpdates = promauto.NewCounter(prometheus.CounterOpts{
        Name: "inventory_updates_total",
        Help: "The total number of inventory updates",
    })

    ShippingArrangementTime = promauto.NewHistogram(prometheus.HistogramOpts{
        Name: "shipping_arrangement_seconds",
        Help: "Time taken to arrange shipping",
        Buckets: prometheus.LinearBuckets(0, 60, 5), // 0-300 seconds, 60-second buckets
    })
)

Melaksanakan Metrik Khusus Aplikasi dalam Perkhidmatan Go Kami

Sekarang kami telah menentukan metrik kami, mari laksanakan metrik tersebut dalam perkhidmatan kami:

package main

import (
    "time"

    "github.com/yourusername/order-processing-system/metrics"
)

func createOrder(order Order) error {
    startTime := time.Now()

    // Order creation logic...

    metrics.OrdersCreated.Inc()
    metrics.OrderProcessingTime.Observe(time.Since(startTime).Seconds())
    metrics.OrderStatusGauge.WithLabelValues("pending").Inc()

    return nil
}

func processPayment(payment Payment) error {
    // Payment processing logic...

    if paymentSuccessful {
        metrics.PaymentProcessed.WithLabelValues("success").Inc()
    } else {
        metrics.PaymentProcessed.WithLabelValues("failure").Inc()
    }

    return nil
}

func updateInventory(item Item) error {
    // Inventory update logic...

    metrics.InventoryUpdates.Inc()

    return nil
}

func arrangeShipping(order Order) error {
    startTime := time.Now()

    // Shipping arrangement logic...

    metrics.ShippingArrangementTime.Observe(time.Since(startTime).Seconds())

    return nil
}

Amalan Terbaik untuk Penamaan dan Metrik Pelabelan

Apabila menamakan dan melabel metrik, pertimbangkan amalan terbaik ini:

  1. Use a consistent naming scheme (e.g., __)
  2. Use clear, descriptive names
  3. Include units in the metric name (e.g., _seconds, _bytes)
  4. Use labels to differentiate instances of a metric, but be cautious of high cardinality
  5. Keep the number of labels manageable

Instrumenting Key Components: API Endpoints, Database Operations, Temporal Workflows

For API endpoints, we’ve already implemented basic instrumentation. For database operations, we can add metrics like this:

func (s *Store) GetOrder(ctx context.Context, id int64) (Order, error) {
    startTime := time.Now()
    defer func() {
        metrics.DBOperationDuration.WithLabelValues("GetOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing GetOrder logic...
}

For Temporal workflows, we can add metrics in our activity implementations:

func ProcessOrderActivity(ctx context.Context, order Order) error {
    startTime := time.Now()
    defer func() {
        metrics.WorkflowActivityDuration.WithLabelValues("ProcessOrder").Observe(time.Since(startTime).Seconds())
    }()

    // Existing ProcessOrder logic...
}

5. Creating Dashboards with Grafana

Now that we have our metrics set up, let’s visualize them using Grafana.

Installing and Configuring Grafana

First, let’s add Grafana to our docker-compose.yml:

services:
  # ... other services ...

  grafana:
    image: grafana/grafana:8.2.2
    ports:
      - 3000:3000
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  # ... other volumes ...
  grafana_data: {}

Connecting Grafana to Our Prometheus Data Source

  1. Access Grafana at http://localhost:3000 (default credentials are admin/admin)
  2. Go to Configuration > Data Sources
  3. Click “Add data source” and select Prometheus
  4. Set the URL to http://prometheus:9090 (this is the Docker service name)
  5. Click “Save & Test”

Designing Effective Dashboards for Our Order Processing System

Let’s create a dashboard for our order processing system:

  1. Click “Create” > “Dashboard”
  2. Add a new panel

For our first panel, let’s create a graph of order creation rate:

  1. In the query editor, enter: rate(orders_created_total[5m])
  2. Set the panel title to “Order Creation Rate”
  3. Under Settings, set the unit to “orders/second”

Let’s add another panel for order processing time:

  1. Add a new panel
  2. Query: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m]))
  3. Title: “95th Percentile Order Processing Time”
  4. Unit: “seconds”

For order status distribution:

  1. Add a new panel
  2. Query: orders_by_status
  3. Visualization: Pie Chart
  4. Title: “Order Status Distribution”

Continue adding panels for other metrics we’ve defined.

Implementing Variable Templating for Flexible Dashboards

Grafana allows us to create variables that can be used across the dashboard. Let’s create a variable for time range:

  1. Go to Dashboard Settings > Variables
  2. Click “Add variable”
  3. Name: time_range
  4. Type: Interval
  5. Values: 5m,15m,30m,1h,6h,12h,24h,7d

Now we can use this in our queries like this: rate(orders_created_total[$time_range])

Best Practices for Dashboard Design and Organization

  1. Group related panels together
  2. Use consistent color schemes
  3. Include a description for each panel
  4. Use appropriate visualizations for each metric type
  5. Consider creating separate dashboards for different aspects of the system (e.g., Orders, Inventory, Shipping)

In the next section, we’ll set up alerting rules to notify us of potential issues in our system.

6. Implementing Alerting Rules

Now that we have our metrics and dashboards set up, let’s implement alerting to proactively notify us of potential issues in our system.

Designing an Alerting Strategy for Our System

When designing alerts, consider the following principles:

  1. Alert on symptoms, not causes
  2. Ensure alerts are actionable
  3. Avoid alert fatigue by only alerting on critical issues
  4. Use different severity levels for different types of issues

For our order processing system, we might want to alert on:

  1. High error rate in order processing
  2. Slow order processing time
  3. Unusual spike or drop in order creation rate
  4. Low inventory levels
  5. High rate of payment failures

Implementing Prometheus Alerting Rules

Let’s create an alerts.yml file in our Prometheus configuration directory:

groups:
- name: order_processing_alerts
  rules:
  - alert: HighOrderProcessingErrorRate
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: High order processing error rate
      description: "Error rate is over the last 5 minutes"

  - alert: SlowOrderProcessing
    expr: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m])) > 300
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Slow order processing
      description: "95th percentile of order processing time is over the last 5 minutes"

  - alert: UnusualOrderRate
    expr: abs(rate(orders_created_total[1h]) - rate(orders_created_total[1h] offset 1d)) > (rate(orders_created_total[1h] offset 1d) * 0.3)
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Unusual order creation rate
      description: "Order creation rate has changed by more than 30% compared to the same time yesterday"

  - alert: LowInventory
    expr: inventory_level < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Low inventory level
      description: "Inventory level for is "

  - alert: HighPaymentFailureRate
    expr: rate(payments_processed_total{status="failure"}[15m]) / rate(payments_processed_total[15m]) > 0.1
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High payment failure rate
      description: "Payment failure rate is over the last 15 minutes"

Update your prometheus.yml to include this alerts file:

rule_files:
  - "alerts.yml"

Setting Up Alertmanager for Alert Routing and Grouping

Now, let’s set up Alertmanager to handle our alerts. Add Alertmanager to your docker-compose.yml:

services:
  # ... other services ...

  alertmanager:
    image: prom/alertmanager:v0.23.0
    ports:
      - 9093:9093
    volumes:
      - ./alertmanager:/etc/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'

Create an alertmanager.yml in the ./alertmanager directory:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'team@example.com'
    from: 'alertmanager@example.com'
    smarthost: 'smtp.example.com:587'
    auth_username: 'alertmanager@example.com'
    auth_identity: 'alertmanager@example.com'
    auth_password: 'password'

Update your prometheus.yml to point to Alertmanager:

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Configuring Notification Channels

In the Alertmanager configuration above, we’ve set up email notifications. You can also configure other channels like Slack, PagerDuty, or custom webhooks.

Implementing Alert Severity Levels and Escalation Policies

In our alerts, we’ve used severity labels. We can use these in Alertmanager to implement different routing or notification strategies based on severity:

route:
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 1h
  receiver: 'email-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'pagerduty-critical'
  - match:
      severity: warning
    receiver: 'slack-warnings'

receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'team@example.com'
- name: 'pagerduty-critical'
  pagerduty_configs:
  - service_key: '<your-pagerduty-service-key>'
- name: 'slack-warnings'
  slack_configs:
  - api_url: '<your-slack-webhook-url>'
    channel: '#alerts'

7. Monitoring Database Performance

Monitoring database performance is crucial for maintaining a responsive and reliable system. Let’s set up monitoring for our PostgreSQL database.

Implementing the Postgres Exporter for Prometheus

First, add the Postgres exporter to your docker-compose.yml:

services:
  # ... other services ...

  postgres_exporter:
    image: wrouesnel/postgres_exporter:latest
    environment:
      DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable"
    ports:
      - 9187:9187

Make sure to replace user, password, and dbname with your actual PostgreSQL credentials.

Key Metrics to Monitor for Postgres Performance

Some important PostgreSQL metrics to monitor include:

  1. Number of active connections
  2. Database size
  3. Query execution time
  4. Cache hit ratio
  5. Replication lag (if using replication)
  6. Transaction rate
  7. Tuple operations (inserts, updates, deletes)

Creating a Database Performance Dashboard in Grafana

Let’s create a new dashboard for database performance:

  1. Create a new dashboard in Grafana
  2. Add a panel for active connections:
    • Query: pg_stat_activity_count{datname="your_database_name"}
    • Title: “Active Connections”
  3. Add a panel for database size:
    • Query: pg_database_size_bytes{datname="your_database_name"}
    • Title: “Database Size”
    • Unit: bytes(IEC)
  4. Add a panel for query execution time:
    • Query: rate(pg_stat_database_xact_commit{datname="your_database_name"}[5m]) + rate(pg_stat_database_xact_rollback{datname="your_database_name"}[5m])
    • Title: “Transactions per Second”
  5. Add a panel for cache hit ratio:
    • Query: pg_stat_database_blks_hit{datname="your_database_name"} / (pg_stat_database_blks_hit{datname="your_database_name"} + pg_stat_database_blks_read{datname="your_database_name"})
    • Title: “Cache Hit Ratio”

Setting Up Alerts for Database Issues

Let’s add some database-specific alerts to our alerts.yml:

  - alert: HighDatabaseConnections
    expr: pg_stat_activity_count > 100
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: High number of database connections
      description: "There are active database connections"

  - alert: LowCacheHitRatio
    expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.9
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: Low database cache hit ratio
      description: "Cache hit ratio is "

8. Monitoring Temporal Workflows

Monitoring Temporal workflows is essential for ensuring the reliability and performance of our order processing system.

Implementing Temporal Metrics in Our Go Services

Temporal provides a metrics client that we can use to expose metrics to Prometheus. Let’s update our Temporal worker to include metrics:

import (
    "go.temporal.io/sdk/client"
    "go.temporal.io/sdk/worker"
    "go.temporal.io/sdk/contrib/prometheus"
)

func main() {
    // ... other setup ...

    // Create Prometheus metrics handler
    metricsHandler := prometheus.NewPrometheusMetricsHandler()

    // Create Temporal client with metrics
    c, err := client.NewClient(client.Options{
        MetricsHandler: metricsHandler,
    })
    if err != nil {
        log.Fatalln("Unable to create Temporal client", err)
    }
    defer c.Close()

    // Create worker with metrics
    w := worker.New(c, "order-processing-task-queue", worker.Options{
        MetricsHandler: metricsHandler,
    })

    // ... register workflows and activities ...

    // Run the worker
    err = w.Run(worker.InterruptCh())
    if err != nil {
        log.Fatalln("Unable to start worker", err)
    }
}

Key Metrics to Monitor for Temporal Workflows

Important Temporal metrics to monitor include:

  1. Workflow start rate
  2. Workflow completion rate
  3. Workflow execution time
  4. Activity success/failure rate
  5. Activity execution time
  6. Task queue latency

Creating a Temporal Workflow Dashboard in Grafana

Let’s create a dashboard for Temporal workflows:

  1. Create a new dashboard in Grafana
  2. Add a panel for workflow start rate:
    • Query: rate(temporal_workflow_start_total[5m])
    • Title: “Workflow Start Rate”
  3. Add a panel for workflow completion rate:
    • Query: rate(temporal_workflow_completed_total[5m])
    • Title: “Workflow Completion Rate”
  4. Add a panel for workflow execution time:
    • Query: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[5m]))
    • Title: “95th Percentile Workflow Execution Time”
    • Unit: seconds
  5. Add a panel for activity success rate:
    • Query: rate(temporal_activity_success_total[5m]) / (rate(temporal_activity_success_total[5m]) + rate(temporal_activity_fail_total[5m]))
    • Title: “Activity Success Rate”

Setting Up Alerts for Workflow Issues

Let’s add some Temporal-specific alerts to our alerts.yml:

  - alert: HighWorkflowFailureRate
    expr: rate(temporal_workflow_failed_total[15m]) / rate(temporal_workflow_completed_total[15m]) > 0.05
    for: 15m
    labels:
      severity: critical
    annotations:
      summary: High workflow failure rate
      description: "Workflow failure rate is over the last 15 minutes"

  - alert: LongRunningWorkflow
    expr: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[1h])) > 3600
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: Long-running workflows detected
      description: "95th percentile of workflow execution time is over 1 hour"

These alerts will help you detect issues with your Temporal workflows, such as high failure rates or unexpectedly long-running workflows.

In the next sections, we’ll cover some advanced Prometheus techniques and discuss testing and validation of our monitoring setup.

9. Advanced Prometheus Techniques

As our monitoring system grows more complex, we can leverage some advanced Prometheus techniques to improve its efficiency and capabilities.

Using Recording Rules for Complex Queries and Aggregations

Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. This can significantly speed up the evaluation of dashboards and alerts.

Let’s add some recording rules to our Prometheus configuration. Create a rules.yml file:

groups:
- name: example_recording_rules
  interval: 5m
  rules:
  - record: job:order_processing_rate:5m
    expr: rate(orders_created_total[5m])

  - record: job:order_processing_error_rate:5m
    expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m])

  - record: job:payment_success_rate:5m
    expr: rate(payments_processed_total{status="success"}[5m]) / rate(payments_processed_total[5m])

Add this file to your Prometheus configuration:

rule_files:
  - "alerts.yml"
  - "rules.yml"

Now you can use these precomputed metrics in your dashboards and alerts, which can be especially helpful for complex queries that you use frequently.

Implementing Push Gateway for Batch Jobs and Short-Lived Processes

The Pushgateway allows you to push metrics from jobs that can’t be scraped, such as batch jobs or serverless functions. Let’s add a Pushgateway to our docker-compose.yml:

services:
  # ... other services ...

  pushgateway:
    image: prom/pushgateway
    ports:
      - 9091:9091

Now, you can push metrics to the Pushgateway from your batch jobs or short-lived processes. Here’s an example using the Go client:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/push"
)

func runBatchJob() {
    // Define a counter for the batch job
    batchJobCounter := prometheus.NewCounter(prometheus.CounterOpts{
        Name: "batch_job_processed_total",
        Help: "Total number of items processed by the batch job",
    })

    // Run your batch job and update the counter
    // ...

    // Push the metric to the Pushgateway
    pusher := push.New("http://pushgateway:9091", "batch_job")
    pusher.Collector(batchJobCounter)
    if err := pusher.Push(); err != nil {
        log.Printf("Could not push to Pushgateway: %v", err)
    }
}

Don’t forget to add the Pushgateway as a target in your Prometheus configuration:

scrape_configs:
  # ... other configs ...

  - job_name: 'pushgateway'
    static_configs:
      - targets: ['pushgateway:9091']

Federated Prometheus Setups for Large-Scale Systems

For large-scale systems, you might need to set up Prometheus federation, where one Prometheus server scrapes data from other Prometheus servers. This allows you to aggregate metrics from multiple Prometheus instances.

Here’s an example configuration for a federated Prometheus setup:

scrape_configs:
  - job_name: 'federate'
    scrape_interval: 15s
    honor_labels: true
    metrics_path: '/federate'
    params:
      'match[]':
        - '{job="order_processing_api"}'
        - '{job="postgres_exporter"}'
    static_configs:
      - targets:
        - 'prometheus-1:9090'
        - 'prometheus-2:9090'

This configuration allows a higher-level Prometheus server to scrape specific metrics from other Prometheus servers.

Using Exemplars for Tracing Integration

Exemplars allow you to link metrics to trace data, providing a way to drill down from a high-level metric to a specific trace. This is particularly useful when integrating Prometheus with distributed tracing systems like Jaeger or Zipkin.

To use exemplars, you need to enable them in your Prometheus configuration:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  exemplar_storage:
    enable: true

Then, when instrumenting your code, you can add exemplars to your metrics:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    orderProcessingDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name: "order_processing_duration_seconds",
            Help: "Duration of order processing in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"status"},
    )
)

func processOrder(order Order) {
    start := time.Now()
    // Process the order...
    duration := time.Since(start)

    orderProcessingDuration.WithLabelValues(order.Status).Observe(duration.Seconds(),
        prometheus.Labels{
            "traceID": getCurrentTraceID(),
        },
    )
}

This allows you to link from a spike in order processing duration directly to the trace of a slow order, greatly aiding in debugging and performance analysis.

10. Testing and Validation

Ensuring the reliability of your monitoring system is crucial. Let’s explore some strategies for testing and validating our Prometheus setup.

Unit Testing Metric Instrumentation

When unit testing your Go code, you can use the prometheus/testutil package to verify that your metrics are being updated correctly:

import (
    "testing"

    "github.com/prometheus/client_golang/prometheus/testutil"
)

func TestOrderProcessing(t *testing.T) {
    // Process an order
    processOrder(Order{ID: 1, Status: "completed"})

    // Check if the metric was updated
    expected := `
        # HELP order_processing_duration_seconds Duration of order processing in seconds
        # TYPE order_processing_duration_seconds histogram
        order_processing_duration_seconds_bucket{status="completed",le="0.005"} 1
        order_processing_duration_seconds_bucket{status="completed",le="0.01"} 1
        # ... other buckets ...
        order_processing_duration_seconds_sum{status="completed"} 0.001
        order_processing_duration_seconds_count{status="completed"} 1
    `
    if err := testutil.CollectAndCompare(orderProcessingDuration, strings.NewReader(expected)); err != nil {
        t.Errorf("unexpected collecting result:\n%s", err)
    }
}

Integration Testing for Prometheus Scraping

To test that Prometheus is correctly scraping your metrics, you can set up an integration test that starts your application, waits for Prometheus to scrape it, and then queries Prometheus to verify the metrics:

func TestPrometheusIntegration(t *testing.T) {
    // Start your application
    go startApp()

    // Wait for Prometheus to scrape (adjust the sleep time as needed)
    time.Sleep(30 * time.Second)

    // Query Prometheus
    client, err := api.NewClient(api.Config{
        Address: "http://localhost:9090",
    })
    if err != nil {
        t.Fatalf("Error creating client: %v", err)
    }

    v1api := v1.NewAPI(client)
    ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
    defer cancel()
    result, warnings, err := v1api.Query(ctx, "order_processing_duration_seconds_count", time.Now())
    if err != nil {
        t.Fatalf("Error querying Prometheus: %v", err)
    }
    if len(warnings) > 0 {
        t.Logf("Warnings: %v", warnings)
    }

    // Check the result
    if result.(model.Vector).Len() == 0 {
        t.Errorf("Expected non-empty result")
    }
}

Load Testing and Observing Metrics Under Stress

It’s important to verify that your monitoring system performs well under load. You can use tools like hey or vegeta to generate load on your system while observing your metrics:

hey -n 10000 -c 100 http://localhost:8080/orders

While the load test is running, observe your Grafana dashboards and check that your metrics are updating as expected and that Prometheus is able to keep up with the increased load.

Validating Alerting Rules and Notification Channels

To test your alerting rules, you can temporarily adjust the thresholds to trigger alerts, or use Prometheus’s API to manually fire alerts:

curl -H "Content-Type: application/json" -d '{
  "alerts": [
    {
      "labels": {
        "alertname": "HighOrderProcessingErrorRate",
        "severity": "critical"
      },
      "annotations": {
        "summary": "High order processing error rate"
      }
    }
  ]
}' http://localhost:9093/api/v1/alerts

This will send a test alert to your Alertmanager, allowing you to verify that your notification channels are working correctly.

11. Challenges and Considerations

As you implement and scale your monitoring system, keep these challenges and considerations in mind:

Managing Cardinality in High-Dimensional Data

High cardinality can lead to performance issues in Prometheus. Be cautious when adding labels to metrics, especially labels with many possible values (like user IDs or IP addresses). Instead, consider using histogram metrics or reducing the cardinality by grouping similar values.

Scaling Prometheus for Large-Scale Systems

For large-scale systems, consider:

  • Using the Pushgateway for batch jobs
  • Implementing federation for large-scale setups
  • Using remote storage solutions for long-term storage of metrics

Ensuring Monitoring System Reliability and Availability

Your monitoring system is critical infrastructure. Consider:

  • Melaksanakan ketersediaan tinggi untuk Prometheus dan Alertmanager
  • Memantau sistem pemantauan anda (meta-monitoring)
  • Sandarkan data Prometheus anda secara kerap

Pertimbangan Keselamatan untuk Metrik dan Makluman

Pastikan bahawa:

  • Akses kepada Prometheus dan Grafana dijamin dengan betul
  • Maklumat sensitif tidak didedahkan dalam metrik atau makluman
  • TLS digunakan untuk semua komunikasi dalam timbunan pemantauan anda

Menangani Isu Sementara dan Makluman Berkepak

Untuk mengurangkan bunyi amaran:

  • Gunakan tetingkap masa yang sesuai dalam peraturan makluman anda
  • Laksanakan pengumpulan makluman dalam Alertmanager
  • Pertimbangkan menggunakan perencatan amaran untuk makluman berkaitan

12. Langkah Seterusnya dan Pratonton Bahagian 5

Dalam siaran ini, kami telah merangkumi pemantauan dan amaran komprehensif untuk sistem pemprosesan pesanan kami menggunakan Prometheus dan Grafana. Kami telah menyediakan metrik tersuai, mencipta papan pemuka bermaklumat, melaksanakan makluman dan meneroka teknik dan pertimbangan lanjutan.

Dalam bahagian seterusnya siri kami, kami akan menumpukan pada pengesanan dan pembalakan yang diedarkan. Kami akan meliputi:

  1. Melaksanakan pengesanan teragih dengan OpenTelemetry
  2. Menyediakan pengelogan berpusat dengan timbunan ELK
  3. Menghubungkan log, jejak dan metrik untuk penyahpepijatan yang berkesan
  4. Melaksanakan pengagregatan dan analisis log
  5. Amalan terbaik untuk log masuk seni bina perkhidmatan mikro

Nantikan semasa kami terus mempertingkatkan sistem pemprosesan pesanan kami, memfokus seterusnya untuk mendapatkan cerapan yang lebih mendalam tentang tingkah laku dan prestasi sistem yang diedarkan kami!


Perlukan Bantuan?

Adakah anda menghadapi masalah yang mencabar, atau memerlukan perspektif luaran tentang idea atau projek baharu? Saya boleh tolong! Sama ada anda ingin membina konsep bukti teknologi sebelum membuat pelaburan yang lebih besar, atau anda memerlukan panduan tentang isu yang sukar, saya sedia membantu.

Perkhidmatan yang Ditawarkan:

  • Penyelesaian Masalah: Menangani isu yang rumit dengan penyelesaian yang inovatif.
  • Perundingan: Memberikan nasihat pakar dan pandangan baharu tentang projek anda.
  • Bukti Konsep: Membangunkan model awal untuk menguji dan mengesahkan idea anda.

Jika anda berminat untuk bekerja dengan saya, sila hubungi melalui e-mel di hungaikevin@gmail.com.

Mari jadikan cabaran anda sebagai peluang!

위 내용은 주문 처리 시스템 구현: 부품 모니터링 및 경고의 상세 내용입니다. 자세한 내용은 PHP 중국어 웹사이트의 기타 관련 기사를 참조하세요!

성명:
본 글의 내용은 네티즌들의 자발적인 기여로 작성되었으며, 저작권은 원저작자에게 있습니다. 본 사이트는 이에 상응하는 법적 책임을 지지 않습니다. 표절이나 침해가 의심되는 콘텐츠를 발견한 경우 admin@php.cn으로 문의하세요.