Monitoring & Observability
The Earna AI platform uses a comprehensive monitoring stack deployed in the observability
namespace, providing real-time insights into system performance, service health, and business metrics.
Architecture Overview
Our monitoring infrastructure consists of:
- Prometheus: Time-series metrics collection and alerting
- Grafana: Visualization dashboards and analytics
- Istio Service Mesh: Distributed tracing and traffic monitoring
- Google Cloud Monitoring: Cloud infrastructure metrics integration
Prometheus Configuration
Metrics Collection
Prometheus is configured to scrape metrics from multiple sources:
# Scraping configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'platform-production'
environment: 'production'
Monitored Services
-
Istio Service Mesh
- Telemetry data from all mesh services
- Request latency, throughput, and error rates
- Circuit breaker status and health checks
-
Application Pods
- Custom application metrics via
/metrics
endpoint - Kubernetes pod annotations for service discovery
- Health check and readiness probe metrics
- Custom application metrics via
-
Envoy Proxy Statistics
- Connection pool metrics
- Load balancing statistics
- HTTP/gRPC request metrics
Alerting Rules
Latency Monitoring
- HighLatency: 99th percentile > 500ms (Warning)
- VeryHighLatency: 99th percentile > 1000ms (Critical)
Traffic Analysis
- HighErrorRate: 5xx errors > 1% for 5 minutes
- TrafficSpike: 2x traffic increase compared to previous hour
Service Health
- CircuitBreakerOpen: Circuit breaker activation detection
- Service availability and dependency health checks
Grafana Dashboards
Data Sources
- Primary: Prometheus server at
prometheus-server.monitoring.svc.cluster.local
- Secondary: Google Cloud Monitoring for infrastructure metrics
Dashboard Configuration
# Pre-configured dashboards
dashboards:
default:
tigerbeetle-dashboard:
gnetId: 1860
datasource: Prometheus
Key Metrics Visualized
-
TigerBeetle Financial Ledger
- Transaction throughput and latency
- Account balance accuracy verification
- Database performance metrics
-
Service Mesh Traffic
- Request success rates across services
- Inter-service communication patterns
- Load distribution visualization
-
Infrastructure Health
- Node resource utilization
- Pod memory and CPU consumption
- Storage I/O performance
Access & Authentication
Grafana Access
- Service Type: LoadBalancer with external IP
- Authentication: Admin password stored in Kubernetes secrets
- Persistence: 10GB premium SSD storage for dashboard retention
Security Configuration
# Persistent storage for Grafana
persistence:
enabled: true
size: 10Gi
storageClassName: premium-rwo
Deployment Architecture
Namespace Organization
- observability: Monitoring infrastructure
- istio-system: Service mesh components
- tigerbeetle: Financial ledger monitoring
- temporal: Workflow engine metrics
High Availability
- Prometheus server with persistent storage
- Grafana with LoadBalancer service
- Alert manager for notification routing
- Service mesh redundancy across nodes
Integration Points
TigerBeetle Monitoring
- Custom metrics for transaction validation
- Account balance integrity checks
- Performance monitoring for high-frequency trading
Temporal Workflows
- Workflow execution success rates
- Task queue depth and processing times
- Activity retry and failure tracking
API Gateway Metrics
- Request routing and load balancing
- Authentication success rates
- Rate limiting enforcement
Best Practices
Metric Collection
-
Annotation-based Discovery
annotations: prometheus.io/scrape: "true" prometheus.io/port: "8080" prometheus.io/path: "/metrics"
-
Service Level Objectives
- Define SLIs (Service Level Indicators)
- Monitor SLO compliance
- Alert on SLA threshold breaches
Dashboard Design
- Focus on business metrics over technical metrics
- Use consistent color schemes and naming
- Implement drill-down capabilities for troubleshooting
Alert Management
- Prioritize alerts by business impact
- Implement alert fatigue prevention
- Use runbook links for incident response
Last updated on