How to create and manage runbooks in Relvy for automated debugging
Planner
, which orchestrates the overall debugging strategy.Data Source Agents
for logs, metrics (dashboards), events, and traces.Title | General Debugging Instructions |
When to Use | When debugging any incident |
Instructions | 1. Check RED metrics dashboard as a starting point for most investigations 2. For user facing issues, check frontend service logs and metrics 3. To check recent deployments, filter events by @source:kubernetes and look for pod restarts, scaling or service deployments 4. Check Runtime metrics dashboard for CPU/memory utilization 5. To locate traces from logs, use otel.trace_id and otel.span_id |
Title | Latency in Core APIs |
When to Use | When alerts mention increased latency in APIs |
Instructions | 1. Begin with the API service metrics dashboard 2. Compare current latency to 1h and 24h baselines 3. Check for saturation in DB or cache services connected to the API 4. Investigate traces for slow spans and associated service calls 5. Review logs for errors or warnings in the same time range |
Title | Debugging Kafka issues |
When to Use | When debugging a kafka related alert |
Instructions | 1. Look at the kafka dashboard for the appropriate topic / consumer group - identify specific affected partitions 2. Identify if any consumer pods are down 3. Check logs for consumer and producer services 4. For lag issues, check if this is because of traffic surges 5. Finally, check the kafka infra metrics dashboards for issues with kafka itself |
Title | Application and System Description |
When to Use | When debugging all incidents |
Instructions | This is an ecommerce application (Astronomy Shop). This is the list of critical services: - accounting - ad - cart - checkout - currency - frontend - payment - product-catalog - quote - recommendation - shipping |