欢迎来到我们关于实施复杂订单处理系统系列的第四部分!在之前的文章中,我们为我们的项目奠定了基础,探索了高级时态工作流程,并深入研究了高级数据库操作。今天,我们关注任何生产就绪系统的一个同样重要的方面:监控和警报。
在微服务架构中,尤其是处理订单管理等复杂流程的架构中,有效的监控和警报至关重要。它们使我们能够:
Prometheus 是一个开源系统监控和警报工具包。由于其强大的功能和广泛的生态系统,它已成为云原生世界的标准。关键组件包括:
我们还将使用 Grafana(一个流行的监控和可观察性开源平台)来创建仪表板并可视化我们的 Prometheus 数据。
读完本文,您将能够:
让我们开始吧!
在开始实施之前,让我们回顾一些对于我们的监控和警报设置至关重要的关键概念。
可观察性是指通过检查系统的输出来了解系统内部状态的能力。在像我们的订单处理系统这样的分布式系统中,可观察性通常包含三个主要支柱:
在这篇文章中,我们将主要关注指标,尽管我们将讨论如何将这些指标与日志和跟踪集成。
Prometheus 遵循基于拉动的架构:
Prometheus 提供四种核心指标类型:
PromQL(Prometheus Query Language)是一种用于查询 Prometheus 数据的强大函数式语言。它允许您实时选择和聚合时间序列数据。主要功能包括:
在构建仪表板和警报时,我们将看到 PromQL 查询的示例。
Grafana 是一个多平台开源分析和交互式可视化 Web 应用程序。当连接到受支持的数据源(Prometheus 就是其中之一)时,它会为网络提供图表、图形和警报。主要功能包括:
现在我们已经介绍了这些概念,让我们开始实施我们的监控和警报系统。
让我们首先设置 Prometheus 来监控我们的订单处理系统。
首先,让我们将 Prometheus 添加到 docker-compose.yml 文件中:
services: # ... other services ... prometheus: image: prom/prometheus:v2.30.3 volumes: - ./prometheus:/etc/prometheus - prometheus_data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--web.console.libraries=/usr/share/prometheus/console_libraries' - '--web.console.templates=/usr/share/prometheus/consoles' ports: - 9090:9090 volumes: # ... other volumes ... prometheus_data: {}
接下来,在./prometheus目录下创建prometheus.yml文件:
global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'order_processing_api' static_configs: - targets: ['order_processing_api:8080'] - job_name: 'postgres' static_configs: - targets: ['postgres_exporter:9187']
此配置告诉 Prometheus 从自身、我们的订单处理 API 和 Postgres 导出器(我们稍后将设置)中获取指标。
为了公开 Go 服务的指标,我们将使用 Prometheus 客户端库。首先,将其添加到您的 go.mod 中:
go get github.com/prometheus/client_golang
现在,让我们修改我们的主 Go 文件以公开指标:
package main import ( "net/http" "github.com/gin-gonic/gin" "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" ) var ( httpRequestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total number of HTTP requests", }, []string{"method", "endpoint", "status"}, ) httpRequestDuration = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "http_request_duration_seconds", Help: "Duration of HTTP requests in seconds", Buckets: prometheus.DefBuckets, }, []string{"method", "endpoint"}, ) ) func init() { prometheus.MustRegister(httpRequestsTotal) prometheus.MustRegister(httpRequestDuration) } func main() { r := gin.Default() // Middleware to record metrics r.Use(func(c *gin.Context) { timer := prometheus.NewTimer(httpRequestDuration.WithLabelValues(c.Request.Method, c.FullPath())) c.Next() timer.ObserveDuration() httpRequestsTotal.WithLabelValues(c.Request.Method, c.FullPath(), string(c.Writer.Status())).Inc() }) // Expose metrics endpoint r.GET("/metrics", gin.WrapH(promhttp.Handler())) // ... rest of your routes ... r.Run(":8080") }
此代码设置了两个指标:
对于更动态的环境,Prometheus 支持各种服务发现机制。例如,如果您在 Kubernetes 上运行,则可以使用 Kubernetes SD 配置:
scrape_configs: - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)
此配置将自动发现并从具有适当注释的 pod 中抓取指标。
Prometheus 将数据存储在本地文件系统上的时间序列数据库中。您可以在 Prometheus 配置中配置保留时间和存储大小:
global: scrape_interval: 15s evaluation_interval: 15s storage: tsdb: retention.time: 15d retention.size: 50GB # ... rest of the configuration ...
此配置设置保留期为 15 天,最大存储大小为 50GB。
在下一节中,我们将深入研究为订单处理系统定义和实现自定义指标。
现在我们已经设置了 Prometheus 并实现了基本的 HTTP 指标,让我们定义并实现特定于我们的订单处理系统的自定义指标。
在设计指标时,重要的是要考虑我们希望从系统中获得哪些见解。对于我们的订单处理系统,我们可能想要跟踪:
让我们实现这些指标:
package metrics import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var ( OrdersCreated = promauto.NewCounter(prometheus.CounterOpts{ Name: "orders_created_total", Help: "The total number of created orders", }) OrderProcessingTime = promauto.NewHistogram(prometheus.HistogramOpts{ Name: "order_processing_seconds", Help: "Time taken to process an order", Buckets: prometheus.LinearBuckets(0, 30, 10), // 0-300 seconds, 30-second buckets }) OrderStatusGauge = promauto.NewGaugeVec(prometheus.GaugeOpts{ Name: "orders_by_status", Help: "Number of orders by status", }, []string{"status"}) PaymentProcessed = promauto.NewCounterVec(prometheus.CounterOpts{ Name: "payments_processed_total", Help: "The total number of processed payments", }, []string{"status"}) InventoryUpdates = promauto.NewCounter(prometheus.CounterOpts{ Name: "inventory_updates_total", Help: "The total number of inventory updates", }) ShippingArrangementTime = promauto.NewHistogram(prometheus.HistogramOpts{ Name: "shipping_arrangement_seconds", Help: "Time taken to arrange shipping", Buckets: prometheus.LinearBuckets(0, 60, 5), // 0-300 seconds, 60-second buckets }) )
现在我们已经定义了指标,让我们在我们的服务中实现它们:
package main import ( "time" "github.com/yourusername/order-processing-system/metrics" ) func createOrder(order Order) error { startTime := time.Now() // Order creation logic... metrics.OrdersCreated.Inc() metrics.OrderProcessingTime.Observe(time.Since(startTime).Seconds()) metrics.OrderStatusGauge.WithLabelValues("pending").Inc() return nil } func processPayment(payment Payment) error { // Payment processing logic... if paymentSuccessful { metrics.PaymentProcessed.WithLabelValues("success").Inc() } else { metrics.PaymentProcessed.WithLabelValues("failure").Inc() } return nil } func updateInventory(item Item) error { // Inventory update logic... metrics.InventoryUpdates.Inc() return nil } func arrangeShipping(order Order) error { startTime := time.Now() // Shipping arrangement logic... metrics.ShippingArrangementTime.Observe(time.Since(startTime).Seconds()) return nil }
命名和标记指标时,请考虑以下最佳实践:
For API endpoints, we’ve already implemented basic instrumentation. For database operations, we can add metrics like this:
func (s *Store) GetOrder(ctx context.Context, id int64) (Order, error) { startTime := time.Now() defer func() { metrics.DBOperationDuration.WithLabelValues("GetOrder").Observe(time.Since(startTime).Seconds()) }() // Existing GetOrder logic... }
For Temporal workflows, we can add metrics in our activity implementations:
func ProcessOrderActivity(ctx context.Context, order Order) error { startTime := time.Now() defer func() { metrics.WorkflowActivityDuration.WithLabelValues("ProcessOrder").Observe(time.Since(startTime).Seconds()) }() // Existing ProcessOrder logic... }
Now that we have our metrics set up, let’s visualize them using Grafana.
First, let’s add Grafana to our docker-compose.yml:
services: # ... other services ... grafana: image: grafana/grafana:8.2.2 ports: - 3000:3000 volumes: - grafana_data:/var/lib/grafana volumes: # ... other volumes ... grafana_data: {}
Let’s create a dashboard for our order processing system:
For our first panel, let’s create a graph of order creation rate:
Let’s add another panel for order processing time:
For order status distribution:
Continue adding panels for other metrics we’ve defined.
Grafana allows us to create variables that can be used across the dashboard. Let’s create a variable for time range:
Now we can use this in our queries like this: rate(orders_created_total[$time_range])
In the next section, we’ll set up alerting rules to notify us of potential issues in our system.
Now that we have our metrics and dashboards set up, let’s implement alerting to proactively notify us of potential issues in our system.
When designing alerts, consider the following principles:
For our order processing system, we might want to alert on:
Let’s create an alerts.yml file in our Prometheus configuration directory:
groups: - name: order_processing_alerts rules: - alert: HighOrderProcessingErrorRate expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m]) > 0.05 for: 5m labels: severity: critical annotations: summary: High order processing error rate description: "Error rate is over the last 5 minutes" - alert: SlowOrderProcessing expr: histogram_quantile(0.95, rate(order_processing_seconds_bucket[5m])) > 300 for: 10m labels: severity: warning annotations: summary: Slow order processing description: "95th percentile of order processing time is over the last 5 minutes" - alert: UnusualOrderRate expr: abs(rate(orders_created_total[1h]) - rate(orders_created_total[1h] offset 1d)) > (rate(orders_created_total[1h] offset 1d) * 0.3) for: 30m labels: severity: warning annotations: summary: Unusual order creation rate description: "Order creation rate has changed by more than 30% compared to the same time yesterday" - alert: LowInventory expr: inventory_level < 10 for: 5m labels: severity: warning annotations: summary: Low inventory level description: "Inventory level for is " - alert: HighPaymentFailureRate expr: rate(payments_processed_total{status="failure"}[15m]) / rate(payments_processed_total[15m]) > 0.1 for: 15m labels: severity: critical annotations: summary: High payment failure rate description: "Payment failure rate is over the last 15 minutes"
Update your prometheus.yml to include this alerts file:
rule_files: - "alerts.yml"
Now, let’s set up Alertmanager to handle our alerts. Add Alertmanager to your docker-compose.yml:
services: # ... other services ... alertmanager: image: prom/alertmanager:v0.23.0 ports: - 9093:9093 volumes: - ./alertmanager:/etc/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml'
Create an alertmanager.yml in the ./alertmanager directory:
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'email-notifications' receivers: - name: 'email-notifications' email_configs: - to: 'team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.example.com:587' auth_username: 'alertmanager@example.com' auth_identity: 'alertmanager@example.com' auth_password: 'password'
Update your prometheus.yml to point to Alertmanager:
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
In the Alertmanager configuration above, we’ve set up email notifications. You can also configure other channels like Slack, PagerDuty, or custom webhooks.
In our alerts, we’ve used severity labels. We can use these in Alertmanager to implement different routing or notification strategies based on severity:
route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 1h receiver: 'email-notifications' routes: - match: severity: critical receiver: 'pagerduty-critical' - match: severity: warning receiver: 'slack-warnings' receivers: - name: 'email-notifications' email_configs: - to: 'team@example.com' - name: 'pagerduty-critical' pagerduty_configs: - service_key: '<your-pagerduty-service-key>' - name: 'slack-warnings' slack_configs: - api_url: '<your-slack-webhook-url>' channel: '#alerts'
Monitoring database performance is crucial for maintaining a responsive and reliable system. Let’s set up monitoring for our PostgreSQL database.
First, add the Postgres exporter to your docker-compose.yml:
services: # ... other services ... postgres_exporter: image: wrouesnel/postgres_exporter:latest environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - 9187:9187
Make sure to replace user, password, and dbname with your actual PostgreSQL credentials.
Some important PostgreSQL metrics to monitor include:
Let’s create a new dashboard for database performance:
Let’s add some database-specific alerts to our alerts.yml:
- alert: HighDatabaseConnections expr: pg_stat_activity_count > 100 for: 5m labels: severity: warning annotations: summary: High number of database connections description: "There are active database connections" - alert: LowCacheHitRatio expr: pg_stat_database_blks_hit / (pg_stat_database_blks_hit + pg_stat_database_blks_read) < 0.9 for: 15m labels: severity: warning annotations: summary: Low database cache hit ratio description: "Cache hit ratio is "
Monitoring Temporal workflows is essential for ensuring the reliability and performance of our order processing system.
Temporal provides a metrics client that we can use to expose metrics to Prometheus. Let’s update our Temporal worker to include metrics:
import ( "go.temporal.io/sdk/client" "go.temporal.io/sdk/worker" "go.temporal.io/sdk/contrib/prometheus" ) func main() { // ... other setup ... // Create Prometheus metrics handler metricsHandler := prometheus.NewPrometheusMetricsHandler() // Create Temporal client with metrics c, err := client.NewClient(client.Options{ MetricsHandler: metricsHandler, }) if err != nil { log.Fatalln("Unable to create Temporal client", err) } defer c.Close() // Create worker with metrics w := worker.New(c, "order-processing-task-queue", worker.Options{ MetricsHandler: metricsHandler, }) // ... register workflows and activities ... // Run the worker err = w.Run(worker.InterruptCh()) if err != nil { log.Fatalln("Unable to start worker", err) } }
Important Temporal metrics to monitor include:
Let’s create a dashboard for Temporal workflows:
Let’s add some Temporal-specific alerts to our alerts.yml:
- alert: HighWorkflowFailureRate expr: rate(temporal_workflow_failed_total[15m]) / rate(temporal_workflow_completed_total[15m]) > 0.05 for: 15m labels: severity: critical annotations: summary: High workflow failure rate description: "Workflow failure rate is over the last 15 minutes" - alert: LongRunningWorkflow expr: histogram_quantile(0.95, rate(temporal_workflow_execution_time_bucket[1h])) > 3600 for: 30m labels: severity: warning annotations: summary: Long-running workflows detected description: "95th percentile of workflow execution time is over 1 hour"
These alerts will help you detect issues with your Temporal workflows, such as high failure rates or unexpectedly long-running workflows.
In the next sections, we’ll cover some advanced Prometheus techniques and discuss testing and validation of our monitoring setup.
As our monitoring system grows more complex, we can leverage some advanced Prometheus techniques to improve its efficiency and capabilities.
Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their result as a new set of time series. This can significantly speed up the evaluation of dashboards and alerts.
Let’s add some recording rules to our Prometheus configuration. Create a rules.yml file:
groups: - name: example_recording_rules interval: 5m rules: - record: job:order_processing_rate:5m expr: rate(orders_created_total[5m]) - record: job:order_processing_error_rate:5m expr: rate(order_processing_errors_total[5m]) / rate(orders_created_total[5m]) - record: job:payment_success_rate:5m expr: rate(payments_processed_total{status="success"}[5m]) / rate(payments_processed_total[5m])
Add this file to your Prometheus configuration:
rule_files: - "alerts.yml" - "rules.yml"
Now you can use these precomputed metrics in your dashboards and alerts, which can be especially helpful for complex queries that you use frequently.
The Pushgateway allows you to push metrics from jobs that can’t be scraped, such as batch jobs or serverless functions. Let’s add a Pushgateway to our docker-compose.yml:
services: # ... other services ... pushgateway: image: prom/pushgateway ports: - 9091:9091
Now, you can push metrics to the Pushgateway from your batch jobs or short-lived processes. Here’s an example using the Go client:
import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/push" ) func runBatchJob() { // Define a counter for the batch job batchJobCounter := prometheus.NewCounter(prometheus.CounterOpts{ Name: "batch_job_processed_total", Help: "Total number of items processed by the batch job", }) // Run your batch job and update the counter // ... // Push the metric to the Pushgateway pusher := push.New("http://pushgateway:9091", "batch_job") pusher.Collector(batchJobCounter) if err := pusher.Push(); err != nil { log.Printf("Could not push to Pushgateway: %v", err) } }
Don’t forget to add the Pushgateway as a target in your Prometheus configuration:
scrape_configs: # ... other configs ... - job_name: 'pushgateway' static_configs: - targets: ['pushgateway:9091']
For large-scale systems, you might need to set up Prometheus federation, where one Prometheus server scrapes data from other Prometheus servers. This allows you to aggregate metrics from multiple Prometheus instances.
Here’s an example configuration for a federated Prometheus setup:
scrape_configs: - job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="order_processing_api"}' - '{job="postgres_exporter"}' static_configs: - targets: - 'prometheus-1:9090' - 'prometheus-2:9090'
This configuration allows a higher-level Prometheus server to scrape specific metrics from other Prometheus servers.
Exemplars allow you to link metrics to trace data, providing a way to drill down from a high-level metric to a specific trace. This is particularly useful when integrating Prometheus with distributed tracing systems like Jaeger or Zipkin.
To use exemplars, you need to enable them in your Prometheus configuration:
global: scrape_interval: 15s evaluation_interval: 15s exemplar_storage: enable: true
Then, when instrumenting your code, you can add exemplars to your metrics:
import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promauto" ) var ( orderProcessingDuration = promauto.NewHistogramVec( prometheus.HistogramOpts{ Name: "order_processing_duration_seconds", Help: "Duration of order processing in seconds", Buckets: prometheus.DefBuckets, }, []string{"status"}, ) ) func processOrder(order Order) { start := time.Now() // Process the order... duration := time.Since(start) orderProcessingDuration.WithLabelValues(order.Status).Observe(duration.Seconds(), prometheus.Labels{ "traceID": getCurrentTraceID(), }, ) }
This allows you to link from a spike in order processing duration directly to the trace of a slow order, greatly aiding in debugging and performance analysis.
Ensuring the reliability of your monitoring system is crucial. Let’s explore some strategies for testing and validating our Prometheus setup.
When unit testing your Go code, you can use the prometheus/testutil package to verify that your metrics are being updated correctly:
import ( "testing" "github.com/prometheus/client_golang/prometheus/testutil" ) func TestOrderProcessing(t *testing.T) { // Process an order processOrder(Order{ID: 1, Status: "completed"}) // Check if the metric was updated expected := ` # HELP order_processing_duration_seconds Duration of order processing in seconds # TYPE order_processing_duration_seconds histogram order_processing_duration_seconds_bucket{status="completed",le="0.005"} 1 order_processing_duration_seconds_bucket{status="completed",le="0.01"} 1 # ... other buckets ... order_processing_duration_seconds_sum{status="completed"} 0.001 order_processing_duration_seconds_count{status="completed"} 1 ` if err := testutil.CollectAndCompare(orderProcessingDuration, strings.NewReader(expected)); err != nil { t.Errorf("unexpected collecting result:\n%s", err) } }
To test that Prometheus is correctly scraping your metrics, you can set up an integration test that starts your application, waits for Prometheus to scrape it, and then queries Prometheus to verify the metrics:
func TestPrometheusIntegration(t *testing.T) { // Start your application go startApp() // Wait for Prometheus to scrape (adjust the sleep time as needed) time.Sleep(30 * time.Second) // Query Prometheus client, err := api.NewClient(api.Config{ Address: "http://localhost:9090", }) if err != nil { t.Fatalf("Error creating client: %v", err) } v1api := v1.NewAPI(client) ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second) defer cancel() result, warnings, err := v1api.Query(ctx, "order_processing_duration_seconds_count", time.Now()) if err != nil { t.Fatalf("Error querying Prometheus: %v", err) } if len(warnings) > 0 { t.Logf("Warnings: %v", warnings) } // Check the result if result.(model.Vector).Len() == 0 { t.Errorf("Expected non-empty result") } }
It’s important to verify that your monitoring system performs well under load. You can use tools like hey or vegeta to generate load on your system while observing your metrics:
hey -n 10000 -c 100 http://localhost:8080/orders
While the load test is running, observe your Grafana dashboards and check that your metrics are updating as expected and that Prometheus is able to keep up with the increased load.
To test your alerting rules, you can temporarily adjust the thresholds to trigger alerts, or use Prometheus’s API to manually fire alerts:
curl -H "Content-Type: application/json" -d '{ "alerts": [ { "labels": { "alertname": "HighOrderProcessingErrorRate", "severity": "critical" }, "annotations": { "summary": "High order processing error rate" } } ] }' http://localhost:9093/api/v1/alerts
This will send a test alert to your Alertmanager, allowing you to verify that your notification channels are working correctly.
As you implement and scale your monitoring system, keep these challenges and considerations in mind:
High cardinality can lead to performance issues in Prometheus. Be cautious when adding labels to metrics, especially labels with many possible values (like user IDs or IP addresses). Instead, consider using histogram metrics or reducing the cardinality by grouping similar values.
For large-scale systems, consider:
Your monitoring system is critical infrastructure. Consider:
确保:
减少警报噪音:
在这篇文章中,我们介绍了使用 Prometheus 和 Grafana 对订单处理系统进行全面监控和警报。我们设置了自定义指标,创建了信息丰富的仪表板,实施了警报,并探索了先进的技术和注意事项。
在我们系列的下一部分中,我们将重点关注分布式跟踪和日志记录。我们将介绍:
请继续关注我们继续增强我们的订单处理系统,接下来的重点是更深入地了解我们的分布式系统的行为和性能!
您是否面临着具有挑战性的问题,或者需要外部视角来看待新想法或项目?我可以帮忙!无论您是想在进行更大投资之前建立技术概念验证,还是需要解决困难问题的指导,我都会为您提供帮助。
如果您有兴趣与我合作,请通过电子邮件与我联系:hungaikevin@gmail.com。
让我们将挑战转化为机遇!
以上是实施订单处理系统:零件监控和警报的详细内容。更多信息请关注PHP中文网其他相关文章!