Runbooks to Guide Debugging
How to create and manage runbooks in Relvy for automated debugging
Create Runbooks to Guide Automated Debugging
Runbooks in Relvy do more than document investigation steps — they guide the behavior of Relvy’s AI agent during real-time incident analysis. When incidents occur, Relvy’s AI-based investigation engine uses your runbooks as strategic instructions to plan and execute the debugging process.
How Runbooks Work in Relvy
Relvy’s AI investigation system consists of:
- A
Planner
, which orchestrates the overall debugging strategy. - Specialized
Data Source Agents
for logs, metrics (dashboards), events, and traces.
When an investigation is triggered (manually or via alert), the planner:
- Reads applicable runbooks relevant to the incident symptoms or tags.
- Uses the instructions to prioritize certain signals and tools.
- Dispatches tasks to the data source agents accordingly.
- Aggregates findings into a unified Root Cause Analysis.
Runbooks as AI Guidance
Runbooks provide a way for your team to configure and influence the planner using natural language. You define what your team considers best practices — and the planner follows them when forming its investigation plan.
This allows you to encode organizational knowledge and system-specific workflows directly into the AI.
Creating a Runbook
To add a runbook:
-
Navigate to the Runbooks tab in the Discovery section on the left sidebar.
-
Click Create New Runbook.
-
Fill out the form:
- Title: A clear and concise label.
- Type: e.g., General Planning, Log Analysis, Event Analysis.
- Symptom / When to Use: Describe when this runbook is relevant.
- Instructions: Write your investigation steps in natural language.
- Tags (optional): Add tags like user-facing, latency, kubernetes, etc.
-
Click Create Runbook to save.
Example Runbook Instructions
Example 1: General Debugging Instructions
Title | General Debugging Instructions |
When to Use | When debugging any incident |
Instructions | 1. Check RED metrics dashboard as a starting point for most investigations 2. For user facing issues, check frontend service logs and metrics 3. To check recent deployments, filter events by @source:kubernetes and look for pod restarts, scaling or service deployments 4. Check Runtime metrics dashboard for CPU/memory utilization 5. To locate traces from logs, use otel.trace_id and otel.span_id |
💡 Relvy’s planner will use these general guidelines as a foundation for any investigation, adapting the approach based on the specific incident context.
Example 2: Latency in Backend Services
Title | Latency in Core APIs |
When to Use | When alerts mention increased latency in APIs |
Instructions | 1. Begin with the API service metrics dashboard 2. Compare current latency to 1h and 24h baselines 3. Check for saturation in DB or cache services connected to the API 4. Investigate traces for slow spans and associated service calls 5. Review logs for errors or warnings in the same time range |
💡 The planner will dynamically follow these steps, dispatching tasks to metric, trace, and log agents to execute them in order.
Example 3: Debugging Kafka Issues
Title | Debugging Kafka issues |
When to Use | When debugging a kafka related alert |
Instructions | 1. Look at the kafka dashboard for the appropriate topic / consumer group - identify specific affected partitions 2. Identify if any consumer pods are down 3. Check logs for consumer and producer services 4. For lag issues, check if this is because of traffic surges 5. Finally, check the kafka infra metrics dashboards for issues with kafka itself |
💡 Relvy’s planner will interpret this and structure its investigation to answer the above questions.
Example 4: Application Architecture Overview
Title | Application and System Description |
When to Use | When debugging all incidents |
Instructions | This is an ecommerce application (Astronomy Shop). This is the list of critical services: - accounting - ad - cart - checkout - currency - frontend - payment - product-catalog - quote - recommendation - shipping |
💡 Relvy’s planner will use this architectural knowledge to prioritize services during investigations.
6.5 Runbook Management
- Runbooks can be searched and filtered by tags.
- You can edit them anytime to evolve with your system.
- After an investigation, Relvy highlights the runbooks that were followed and lets you view or modify the instructions for future investigations.
Benefits of Configurable AI Planning
With runbooks:
- You encode team knowledge into the AI, turning experience into automation.
- Investigations become standardized, repeatable, and transparent.
- New team members benefit from a guided process, and experts can continuously improve it.
Whether you’re debugging application errors, latency spikes, infrastructure issues, or deployment regressions — Relvy’s AI will follow your instructions, step-by-step.