Monitoring & Observability

Overview

Comprehensive monitoring stack deployed on GKE for platform observability, featuring Prometheus for metrics collection and Grafana for visualization. Primary focus is monitoring our TigerBeetle financial ledger performance and health.

Architecture


TigerBeetle → StatsD (8125) → StatsD Exporter → Prometheus → Grafana
     ↓                                              ↓           ↓
Application Metrics                          Time Series    Dashboard

Components

Grafana

URL: http://34.172.102.114
Version: 7.0.8
Access: LoadBalancer service
Credentials: admin / [Check Secret Manager]

Prometheus

Version: 25.8.0
Storage: 50GB with 15-day retention
Access: Internal only (prometheus-server.monitoring:80)
Scrape Interval: 15s

StatsD Exporter

Port: 8125 (UDP/TCP)
Namespace: observability
Purpose: Converts StatsD metrics to Prometheus format

Dashboards

TigerBeetle Financial Ledger Dashboard

Real-time monitoring of our financial ledger:

Transaction Per Second (TPS)
Request rates and latencies
Storage I/O operations (IOPS)
Database operations
Active replicas status

Available Metrics

Application Metrics (tb_*)


# Transaction metrics
tb_replica_commit_us_*
tb_replica_request_us_*
 
# Storage metrics
tb_storage_read_us_*
tb_storage_write_us_*
 
# Database operations
tb_lookup_us_*
tb_scan_tree_us_*
tb_compact_mutable_suffix_us_*

Infrastructure Metrics


# Kubernetes metrics
container_cpu_usage_seconds_total
container_memory_usage_bytes
kube_pod_status_phase
kube_node_status_condition
 
# GCP metrics (via Cloud Monitoring)
kubernetes.io/container/cpu/core_usage_time
kubernetes.io/container/memory/used_bytes

Accessing Metrics

Grafana Web UI


# Direct access
http://34.172.102.114
 
# Dashboard: TigerBeetle Performance Dashboard

Prometheus Queries


# Port-forward for local access
kubectl port-forward -n monitoring svc/prometheus-server 9090:80
 
# Example queries
sum(rate(tb_replica_commit_us_count[1m]))  # TPS
count(count by (replica) (tb_replica_commit_us_count))  # Active replicas

kubectl Metrics


# Pod metrics
kubectl top pods -n tigerbeetle
 
# Node metrics
kubectl top nodes

Deployment

Deploy Monitoring Stack


# Run deployment script
./infrastructure/scripts/deploy-monitoring.sh
 
# This will:
# 1. Install Prometheus via Helm
# 2. Install Grafana via Helm
# 3. Configure datasources
# 4. Import dashboards

Update Dashboard


# Edit dashboard
vim infrastructure/kubernetes/monitoring/tigerbeetle-dashboard.json
 
# Update ConfigMap
kubectl delete configmap grafana-dashboard-tigerbeetle -n monitoring
kubectl create configmap grafana-dashboard-tigerbeetle \
  --from-file=tigerbeetle-dashboard.json \
  --namespace monitoring
 
# Label for auto-discovery
kubectl label configmap grafana-dashboard-tigerbeetle \
  grafana_dashboard=1 -n monitoring
 
# Restart Grafana
kubectl rollout restart deployment/grafana -n monitoring

Alerting (Planned)

Alert Rules

TigerBeetle replica down
High transaction latency (> 100ms)
Storage usage > 80%
Pod restart rate > 5/hour

Notification Channels

Slack integration
PagerDuty for critical alerts
Email notifications

Troubleshooting

No Data in Dashboard

Check TigerBeetle is sending metrics
Verify StatsD Exporter is receiving
Confirm Prometheus is scraping
Check dashboard time range

High Memory Usage

Review retention settings
Check cardinality of metrics
Consider downsampling

Cost

Component	Resource	Monthly Cost
Prometheus Storage	50GB	~$8
Grafana Storage	10GB	~$2
LoadBalancer	1 × TCP	~$20
Total		~$30/month

Future Enhancements

Long-term storage in Cloud Storage
Custom alerts for business metrics
Integration with Cloud Monitoring
Distributed tracing with Jaeger
Log aggregation with Loki